Re: Lua 5.1 and UTF-8 ?

lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

Subject: Re: Lua 5.1 and UTF-8 ?
From: Rici Lake <lua@...>
Date: Sun, 22 May 2005 12:49:46 -0500


On 22-May-05, at 11:55 AM, Asko Kauppi wrote:

I've been thinking about UTF-8 and Lua lately, and wonder how muchwork it would be to actually support that in Lua "out of the box".There are some programming languages (s.a. Tck) that claim already todo that, and I feel the concept would match Lua's targets andphilosophy rather nicely.

I guess that depends on what you mean by "support". Lua currently doesnot interfere with UTF-8, but it lacks:


1) UTF8 validation
2) A mechanism for declaring encoding of a source file

3) An escape mechanism for including UTF-8 in string literals (so thatit is only possible by either using a UTF-8 aware text editor, ormanually working out the decimal \ escapes for a particular character)

4) Multicharacter aware string patterns with Unicode character classes

5) Various utilities, including single character-code conversion, codecounting, normalization, etc.

Various people have made various attempts to implement some or all ofthese features; standard libraries exist for them (but they are"bulky").

I understand UTF-8 might not be everyone's favourite, but it is mine.:) And having a working framework (str:bytes(), str:length(),str:width()) could easily be adopted to other extended encodingschemes as well.

There are arguments and counter-arguments for all of the standardUnicode Transfer Formats. UTF-8 is fairly easy to work with if themajority of the work is simply moving strings between components; it isless ideal for text processing, for which UTF-16 is generally better(there are arguments and counter-arguments about using a 32-bitinternal representation; the 16-bit representation is still variablewidth because of surrogate pairs, but the fact that graphemes are oftenrepresented as multiple character codes means that display-orientedtext processing is going to have to be able to deal with variablelength grapheme codes regardless of base encoding.)

The reason I'm bringing this up right now, is that the issue couldsuit nicely with the 5.1 "every type has a metatable" way of thinking;would it warrant an opportunity to have a closer look at what Luameans by 'strings' (or rather, their encoding) anyhow?

I'm pretty firmly of the belief that keeping strings as octet-sequencesis really a simplification. It is not uncommon to have a mixture ofcharacter encodings in a single application, so assigning a metatableto the string type will often prove unsatisfactory. I'm not really surewhat the solution is, but I have been bitten more than once byprogramming languages such as Perl and Python which have gluedcharacter encoding on to their basic string types. (In Python, forexample, a UTF-8 sequence is *not* of type Unicode, which can beseriously awkward.)

If strings are simply octet-sequences, it becomes the programmer'sresponsibility to identify (or remember) the encoding for each string;that can also be awkward but it has the advantage of being clear.

For the record, there are some hidden subtleties, particularly in thearea of normalization. Unicode does not really specify a canonicalnormalization, but it is clear that the intent is that the twonon-compatibility formats do define canonical equality comparison.Unfortunately, this would have a significant impact on the use ofUnicode strings as table keys (which is, indeed, visible in both Perland Python). UTF-8 at least has the virtue that any string which onlycontains codes 0-127 (decimal) is identical between UTF-8 andISO-8859-x, and furthermore that all normalization forms are theidentity function for such strings.

Follow-Ups:
- Re: Lua 5.1 and UTF-8 ?, Asko Kauppi
- Re: Lua 5.1 and UTF-8 ?, Klaus Ripke

References:
- Lua 5.1 and UTF-8 ?, Asko Kauppi

Prev by Date: Lua 5.1 and UTF-8 ?
Next by Date: Re: Lua 5.1 and UTF-8 ?
Previous by thread: Lua 5.1 and UTF-8 ?
Next by thread: Re: Lua 5.1 and UTF-8 ?
Index(es):
- Date
- Thread