lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]



On 7-Dec-06, at 9:14 AM, David Given wrote:

Slightly more seriously, it occurs to me that since composite characters mean you can't rely on any individual glyph being encoded in a single Unicode code-point, then 32-bit Unicode does, in fact, gain you nothing except a false sense of security. You always need to write code to cope with multicharacter
glyphs.

Yes, I agree with that completely. It would have been better
to use native-endian UTF-16 as an internal representation, and
UTF-8 as a transfer encoding, which I believe is what Unicode
Consortium recommends. UTF-16 uses a maximum of 4 bytes to
represent any code point, but the vast majority of code points
actually used fit into 2 bytes.

In fact, UTF-8 also uses a maximum of 4 bytes to represent
any code point, but requires 3 bytes to represent code points
in asian languages, so in general terms it is less compact
than UTF-16, but in some applications ("mostly ascii") it will
turn out to be better.

What you cannot do is naively assume that any encoding is
one-to-one with graphemes.

Regardless of what the encoding of a Unicode stream
might be, an individual code point must be a datatype
of at least 21 bits. Since C's wchar mechanism assumes
that a wide string is an array of wide characters, and
a wide character is going to be an int32 on most hardware,
we're probably stuck with 32 bit internal representation
whether we like it or not.

From the perspective of Lua bindings, though, I wonder
whether it wouldn't be better to just use UTF8 throughout
and convert to native only when necessary.