[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Re: Lua 5.1 and UTF-8 ?
- From: Rici Lake <lua@...>
- Date: Sun, 22 May 2005 12:49:46 -0500
On 22-May-05, at 11:55 AM, Asko Kauppi wrote:
I've been thinking about UTF-8 and Lua lately, and wonder how much
work it would be to actually support that in Lua "out of the box".
There are some programming languages (s.a. Tck) that claim already to
do that, and I feel the concept would match Lua's targets and
philosophy rather nicely.
I guess that depends on what you mean by "support". Lua currently does
not interfere with UTF-8, but it lacks:
1) UTF8 validation
2) A mechanism for declaring encoding of a source file
3) An escape mechanism for including UTF-8 in string literals (so that
it is only possible by either using a UTF-8 aware text editor, or
manually working out the decimal \ escapes for a particular character)
4) Multicharacter aware string patterns with Unicode character classes
5) Various utilities, including single character-code conversion, code
counting, normalization, etc.
Various people have made various attempts to implement some or all of
these features; standard libraries exist for them (but they are
"bulky").
I understand UTF-8 might not be everyone's favourite, but it is mine.
:) And having a working framework (str:bytes(), str:length(),
str:width()) could easily be adopted to other extended encoding
schemes as well.
There are arguments and counter-arguments for all of the standard
Unicode Transfer Formats. UTF-8 is fairly easy to work with if the
majority of the work is simply moving strings between components; it is
less ideal for text processing, for which UTF-16 is generally better
(there are arguments and counter-arguments about using a 32-bit
internal representation; the 16-bit representation is still variable
width because of surrogate pairs, but the fact that graphemes are often
represented as multiple character codes means that display-oriented
text processing is going to have to be able to deal with variable
length grapheme codes regardless of base encoding.)
The reason I'm bringing this up right now, is that the issue could
suit nicely with the 5.1 "every type has a metatable" way of thinking;
would it warrant an opportunity to have a closer look at what Lua
means by 'strings' (or rather, their encoding) anyhow?
I'm pretty firmly of the belief that keeping strings as octet-sequences
is really a simplification. It is not uncommon to have a mixture of
character encodings in a single application, so assigning a metatable
to the string type will often prove unsatisfactory. I'm not really sure
what the solution is, but I have been bitten more than once by
programming languages such as Perl and Python which have glued
character encoding on to their basic string types. (In Python, for
example, a UTF-8 sequence is *not* of type Unicode, which can be
seriously awkward.)
If strings are simply octet-sequences, it becomes the programmer's
responsibility to identify (or remember) the encoding for each string;
that can also be awkward but it has the advantage of being clear.
For the record, there are some hidden subtleties, particularly in the
area of normalization. Unicode does not really specify a canonical
normalization, but it is clear that the intent is that the two
non-compatibility formats do define canonical equality comparison.
Unfortunately, this would have a significant impact on the use of
Unicode strings as table keys (which is, indeed, visible in both Perl
and Python). UTF-8 at least has the virtue that any string which only
contains codes 0-127 (decimal) is identical between UTF-8 and
ISO-8859-x, and furthermore that all normalization forms are the
identity function for such strings.