lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


Roberto Ierusalimschy <roberto@inf.puc-rio.br> wrote:

[Unicode]
>It would be nice if we could make easy in Lua to change the "char" type
>to support Unicode. But I think there are many details that are difficult
>to handle only through macros.

Please don't go this route.  It makes storing/exchanging data a hell.

The way both Tcl and Perl address this issue, is to use UTF-8 to
represent Unicode data.  UTF-8 maps 1:1 on 7-bit ASCII, and uses the
upper 128 chars to create a multi-byte encoding.  The beauty of it is
that a lot of existing code keeps on working as is (even Lua's lexer
would, I expect), the main trade-off is that character-wise (Unicode,
that is) indexing becomes less straightforward, and that the "length" of
a string, in terms of counting Unicode chars, also is no longer
equivalent to the length of the byte-representation.

A few more properties of UTF-8:
 - zero-byte delimiters can continue to work (there may be minor issues)
 - can be exchanged as strings, even with non-Unicode-aware machines
 - no endian-ness issues, UTF-8 is basically a byte-sized string

Python decided to go for a 2-byte internal representation instead, BTW.

Lua could use UTF-8, since it does not have "str[i]" type indexing and is
8-bit clean.  Evidently, the str* functions are affected - but these are
outside the core, and therefore neatly replaceable.  The basic idea would
be to cover all the in and outcoming cases where strings are involved,
and to keep the Lua core mostly as is.

Having said all this, I must add that I know enough about Unicode to know
that I know hardly anything (go ahead, read that sentence again).  It's
tricky stuff, and too easy to overlook implications (capitalization, word
delimiting, ...).  Those considering dealing with this better make sure
they have an expert at hand.

-jcw