Re: question about Unicode

lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

Subject: Re: question about Unicode
From: Rici Lake <lua@...>
Date: Thu, 7 Dec 2006 09:55:19 -0500


On 7-Dec-06, at 9:14 AM, David Given wrote:

Slightly more seriously, it occurs to me that since compositecharacters meanyou can't rely on any individual glyph being encoded in a singleUnicodecode-point, then 32-bit Unicode does, in fact, gain you nothing excepta falsesense of security. You always need to write code to cope withmulticharacter
glyphs.


Yes, I agree with that completely. It would have been better
to use native-endian UTF-16 as an internal representation, and
UTF-8 as a transfer encoding, which I believe is what Unicode
Consortium recommends. UTF-16 uses a maximum of 4 bytes to
represent any code point, but the vast majority of code points
actually used fit into 2 bytes.

In fact, UTF-8 also uses a maximum of 4 bytes to represent
any code point, but requires 3 bytes to represent code points
in asian languages, so in general terms it is less compact
than UTF-16, but in some applications ("mostly ascii") it will
turn out to be better.

What you cannot do is naively assume that any encoding is
one-to-one with graphemes.

Regardless of what the encoding of a Unicode stream
might be, an individual code point must be a datatype
of at least 21 bits. Since C's wchar mechanism assumes
that a wide string is an array of wide characters, and
a wide character is going to be an int32 on most hardware,
we're probably stuck with 32 bit internal representation
whether we like it or not.

From the perspective of Lua bindings, though, I wonder
whether it wouldn't be better to just use UTF8 throughout
and convert to native only when necessary.

Follow-Ups:
- Re: question about Unicode, Roberto Ierusalimschy
- Re: question about Unicode, Glenn Maynard

References:
- question about Unicode, Roberto Ierusalimschy
- Re: question about Unicode, Matt Campbell
- Re: question about Unicode, Roberto Ierusalimschy
- Re: question about Unicode, David Jones
- Re: question about Unicode, Roberto Ierusalimschy
- Re: question about Unicode, David Given

Prev by Date: Re: question about Unicode
Next by Date: Re: question about Unicode
Previous by thread: Re: question about Unicode
Next by thread: Re: question about Unicode
Index(es):
- Date
- Thread