lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


> The way both Tcl and Perl address this issue, is to use UTF-8 to
> represent Unicode data.  UTF-8 maps 1:1 on 7-bit ASCII, and uses the
> upper 128 chars to create a multi-byte encoding.  The beauty of it is
> that a lot of existing code keeps on working as is (even Lua's lexer
> would, I expect), the main trade-off is that character-wise (Unicode,
> that is) indexing becomes less straightforward, and that the "length" of
> a string, in terms of counting Unicode chars, also is no longer
> equivalent to the length of the byte-representation.

I'd like to say that i've used UTF-8 encoding, and found it extremely easy
and straightforward to use. As Jean-Claude Wippler says, most string.h
routines
will work unchanged. I used to consider this encoding as a big hack for
those
who can't switch to 16 bit unicode or other encodings, but it is actually a
very viable solution (and this means too that 'normal' strings don't take up
twice more storage space all of a sudden, as would be required with 16 bit
encoding). Also, I think Lua's hashing would take a performance hit from
having
to use 16 bit characters.
Moreover, the character indexing problem can be neatly hidden behind a bunch
of
conveniency functions that increment a pointer along the string.

--
Vincent Penquerc'h