lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


On Thu, Dec 07, 2006 at 09:55:19AM -0500, Rici Lake wrote:
> Yes, I agree with that completely. It would have been better
> to use native-endian UTF-16 as an internal representation, and
> UTF-8 as a transfer encoding, which I believe is what Unicode
> Consortium recommends. UTF-16 uses a maximum of 4 bytes to
> represent any code point, but the vast majority of code points
> actually used fit into 2 bytes.

UTF-16 is terrible.  It combines the annoyances inherent in any Unicode
representation (combining characters resulting in one glyph being
represented by several codepoints); with the annoyances of a wide
representation (incompatible with regular C strings; if it becomes
desynchronized, eg. due to data error, it'll never resync); with
the annoyances of a multibyte representation (a single codepoint can
take a variable number of data elements; no random access to codepoints).

UTF-32 at least does away with the last: a single data element (wchar_t)
always represents a single codepoint.  That codepoint may not represent
the entire glyph, but that's a separate problem--in UTF-16, you have
to cope with both decoding codepoints, and combining multiple codepoints
into one glyph, which are different issues causing different problems.

(I suspect that a lot of application-level UTF-16 code simply ignores
surrogate pairs, turning it into UCS-2, though.)

-- 
Glenn Maynard