lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]



On 7-Dec-06, at 4:56 PM, Glenn Maynard wrote:

UTF-32 at least does away with the last: a single data element (wchar_t)
always represents a single codepoint.  That codepoint may not represent
the entire glyph, but that's a separate problem--in UTF-16, you have
to cope with both decoding codepoints, and combining multiple codepoints
into one glyph, which are different issues causing different problems.

Actually, I think you could solve both of those problems with the same code. You're not out of the woods using UTF-32, in terms of decoding, unless you're not validating the codes; with UTF-32 the surrogate codes are illegal (as are codes >= 2^20 + 2^16).

(I suspect that a lot of application-level UTF-16 code simply ignores
surrogate pairs, turning it into UCS-2, though.)

Yes. When such code is combined with more modern libraries, it can cause ugly things to happen -- I think that is why some ncurses installations crash when given characters outside of the BMP.

I completely agree that UTF-16 is not appropriate as an exchange format. UTF-8 has the advantage of be resynchronizable, for example, even if it is sometimes bulkier -- and, in any event, if you use compression you'll get roughly the same transmission length for any unicode format.