[Date Prev][Date Next][Thread Prev][Thread Next]
- Subject: Re: question about Unicode
- From: Rici Lake <lua@...>
- Date: Thu, 7 Dec 2006 17:10:47 -0500
On 7-Dec-06, at 4:56 PM, Glenn Maynard wrote:
UTF-32 at least does away with the last: a single data element
always represents a single codepoint. That codepoint may not represent
the entire glyph, but that's a separate problem--in UTF-16, you have
to cope with both decoding codepoints, and combining multiple
into one glyph, which are different issues causing different problems.
Actually, I think you could solve both of those problems with the same
code. You're not out of the woods using UTF-32, in terms of decoding,
unless you're not validating the codes; with UTF-32 the surrogate codes
are illegal (as are codes >= 2^20 + 2^16).
(I suspect that a lot of application-level UTF-16 code simply ignores
surrogate pairs, turning it into UCS-2, though.)
Yes. When such code is combined with more modern libraries, it can
cause ugly things to happen -- I think that is why some ncurses
installations crash when given characters outside of the BMP.
I completely agree that UTF-16 is not appropriate as an exchange
format. UTF-8 has the advantage of be resynchronizable, for example,
even if it is sometimes bulkier -- and, in any event, if you use
compression you'll get roughly the same transmission length for any