[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Re: question about Unicode
- From: Glenn Maynard <glenn@...>
- Date: Thu, 7 Dec 2006 18:57:00 -0500
On Thu, Dec 07, 2006 at 05:10:47PM -0500, Rici Lake wrote:
> >UTF-32 at least does away with the last: a single data element
> >(wchar_t)
> >always represents a single codepoint. That codepoint may not represent
> >the entire glyph, but that's a separate problem--in UTF-16, you have
> >to cope with both decoding codepoints, and combining multiple
> >codepoints
> >into one glyph, which are different issues causing different problems.
>
> Actually, I think you could solve both of those problems with the same
> code. You're not out of the woods using UTF-32, in terms of decoding,
> unless you're not validating the codes; with UTF-32 the surrogate codes
> are illegal (as are codes >= 2^20 + 2^16).
You only need to validate on load, and only from untrusted sources, at
the place data enters your application.
I guess you could handle combining and decoding in the same place, but
I wouldn't--I think of codepoints as the atomic unit of a string, with
glyphs and combining characters layering on top of that. With UTF-32,
you can address a string's codepoints directly. wchar_t s[n] represents
the nth codepoint, and n+1 is always the codepoint following it.
--
Glenn Maynard