[Date Prev][Date Next][Thread Prev][Thread Next]
- Subject: Re: Plea for the support of unicode escape sequences
- From: Edgar Toernig <froese@...>
- Date: Tue, 28 Jun 2011 23:13:29 +0200
Florian Weimer wrote:
> * Edgar Toernig:
> > + case 'U': read_and_save_uniesc(ls, 8); continue;
> Shouldn't \U take only 6 digits, and only characters up to \U1fffff?
> Code points beyond that are no longer representable in the current
> version of UTF-8, after all.
The (computer-)languages I know all use \u plus 4 hex digits and,
if they support non-bmp escapes, \U plus 8 hex digits.
And they don't deal with the complex Unicode standard per se but
only with the encoding of code points used by the standard.
What is representable and what is covered by some standard are
two different things. I.e. the _encoding method_ given in UTF-8
supports encoding of all 32 bits of the \U-escape (as does UTF-32)
but the Unicode standard covers only the range from 0 to 0x10ffff.
Which code-points are actually valid depends on context and the
version number of the standard and are thus hard to check ;-)
Point is: Any valid Unicode code-point, when given as \u/\U-escape,
produces a valid UTF-8 byte stream for now and the foreseeable future.
Sure, you _may_ give and invalid code-points and produce non-valid
bytes but that's easy anyway ("\xff broken unicode").