Re: question about Unicode

lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

Subject: Re: question about Unicode
From: Glenn Maynard <glenn@...>
Date: Thu, 7 Dec 2006 18:57:00 -0500

On Thu, Dec 07, 2006 at 05:10:47PM -0500, Rici Lake wrote:
> >UTF-32 at least does away with the last: a single data element 
> >(wchar_t)
> >always represents a single codepoint.  That codepoint may not represent
> >the entire glyph, but that's a separate problem--in UTF-16, you have
> >to cope with both decoding codepoints, and combining multiple 
> >codepoints
> >into one glyph, which are different issues causing different problems.
> 
> Actually, I think you could solve both of those problems with the same 
> code. You're not out of the woods using UTF-32, in terms of decoding, 
> unless you're not validating the codes; with UTF-32 the surrogate codes 
> are illegal (as are codes >= 2^20 + 2^16).

You only need to validate on load, and only from untrusted sources, at
the place data enters your application.

I guess you could handle combining and decoding in the same place, but
I wouldn't--I think of codepoints as the atomic unit of a string, with
glyphs and combining characters layering on top of that.  With UTF-32,
you can address a string's codepoints directly.  wchar_t s[n] represents
the nth codepoint, and n+1 is always the codepoint following it.

-- 
Glenn Maynard

References:
- question about Unicode, Roberto Ierusalimschy
- Re: question about Unicode, Matt Campbell
- Re: question about Unicode, Roberto Ierusalimschy
- Re: question about Unicode, David Jones
- Re: question about Unicode, Roberto Ierusalimschy
- Re: question about Unicode, David Given
- Re: question about Unicode, Rici Lake
- Re: question about Unicode, Glenn Maynard
- Re: question about Unicode, Rici Lake

Prev by Date: Re: Serializing Lua Functions
Next by Date: Re: question about Unicode
Previous by thread: Re: question about Unicode
Next by thread: Re: question about Unicode
Index(es):
- Date
- Thread