lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


On Thu, Dec 07, 2006 at 05:10:47PM -0500, Rici Lake wrote:
> >UTF-32 at least does away with the last: a single data element 
> >(wchar_t)
> >always represents a single codepoint.  That codepoint may not represent
> >the entire glyph, but that's a separate problem--in UTF-16, you have
> >to cope with both decoding codepoints, and combining multiple 
> >codepoints
> >into one glyph, which are different issues causing different problems.
> 
> Actually, I think you could solve both of those problems with the same 
> code. You're not out of the woods using UTF-32, in terms of decoding, 
> unless you're not validating the codes; with UTF-32 the surrogate codes 
> are illegal (as are codes >= 2^20 + 2^16).

You only need to validate on load, and only from untrusted sources, at
the place data enters your application.

I guess you could handle combining and decoding in the same place, but
I wouldn't--I think of codepoints as the atomic unit of a string, with
glyphs and combining characters layering on top of that.  With UTF-32,
you can address a string's codepoints directly.  wchar_t s[n] represents
the nth codepoint, and n+1 is always the codepoint following it.

-- 
Glenn Maynard