lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

On Sun, May 22, 2005 at 12:49:46PM -0500, Rici Lake wrote:
> On 22-May-05, at 11:55 AM, Asko Kauppi wrote:
> >
> >I've been thinking about UTF-8 and Lua lately, and wonder how much 
> >work it would be to actually support that in Lua "out of the box".  
"out of the box" is a matter of distros,
and I'd be happy if LuaX would ship with the Selene UTF-8 implementation.
However, I would strongly discourage treating every "string"
as a sequence of UTF-8 characters by default.
It is bound to break a lot of things,
and Tcl's UTF-8 handling is a sad showcase for this
(then, you start to have a distinct "binary" type
and everything gets awfully complicated ...).

> >There are some programming languages (s.a. Tck) that claim already to 
> >do that, and I feel the concept would match Lua's targets and 
> >philosophy rather nicely.
> I guess that depends on what you mean by "support". Lua currently does 
> not interfere with UTF-8, but it lacks:
> 1) UTF8 validation
> 2) A mechanism for declaring encoding of a source file
> 3) An escape mechanism for including UTF-8 in string literals (so that 
> it is only possible by either using a UTF-8 aware text editor, or 
> manually working out the decimal \ escapes for a particular character)
> 4) Multicharacter aware string patterns with Unicode character classes
> 5) Various utilities, including single character-code conversion, code 
> counting, normalization, etc.
> Various people have made various attempts to implement some or all of 
> these features; standard libraries exist for them
Selene UTF-8 has 1, 4 and 5.
I do not see a need for 3 or 2.

at 3: you need perfectly no escape mechanism for string literals,
as UTF-8 encodings do not use any special character like quotes.
Just write your sources with a UTF-8 editor and everything is fine.
After all, the lua compiler is consistently 8 bit clean everywhere,

at 2: Why? If you really want to edit your sources
in KOI but want your literals to magically be encoded in
whatever, then feed them through any standard conversion tool.
That would be only an issue if "the system" somehow would
forcefully and sadly treat every string as UTF-8,
but even then it's a matter of code management.
And it's absolutely not an issue of UTF-8; you might as well
require a Cp850 to ISO-Latin-1 translator.
The author of Hamster or package.loaders might consider to include
such a feature, but the Lua compiler should for sure not have to know
anything about encodings, no? We might end up with a XML-style mess.

> (but they are "bulky").
humm, well, any suggestions/patches how to further reduce size
would be *higly* appreciated!
What do you think could be dropped from Selene UTF-8?
After all, the overall size is mostly due to the character table,
and that is already the smallest Unicode table I've ever seen
(from Tcl).

> >I understand UTF-8 might not be everyone's favourite, but it is mine. 
So include it with LuaX.

> >:) And having a working framework (str:bytes(), str:length(), 
> >str:width()) could easily be adopted to other extended encoding 
> >schemes as well.
Methinks that the extend to which we provide variations on the
theme already is quite a stretch, but at least one that can be
implemented efficiently.

> There are arguments and counter-arguments for all of the standard 
> Unicode Transfer Formats.
However, UTF-8 is the one one has to support externally, de facto,
for a couple of reasons. So using other internal formats is a matter
of additional optimizations (if at all). Supporting UCS-2 as yet
another personality would be straightforward but limited, UCS-4
straightforward but expensive, UTF-16 straightforward but ugly,
and grapheme support (which we have) renders the advantage
of "one code point - one character" of all of them void.

As long as *the* Lua string can hold whatever,
it's all a matter of libraries.

> I'm pretty firmly of the belief that keeping strings as octet-sequences 
> is really a simplification.
couldn't agree more

> It is not uncommon to have a mixture of 
> character encodings in a single application, so assigning a metatable 
> to the string type will often prove unsatisfactory. I'm not really sure 
> what the solution is, but I have been bitten more than once by 
> programming languages such as Perl and Python which have glued 
> character encoding on to their basic string types.
... me too.
And far more often than by mistaking one encoding for the other!
You have to control your external I/O anyway (e.g. for Cp850 vs. ISO).

> For the record, there are some hidden subtleties, particularly in the 
> area of normalization. Unicode does not really specify a canonical 
> normalization,
they specify a couple, but they do not matter that much in practice.
E.g. in germany, with our beloved umlauts, it would be next to
suicidal for any application to provide them in decomposed
normal form by default, and probably this holds at least for
Latin-1 land. Where you really have to deal with data integration
from heterogenous sources, you're in much bigger trouble anyway.
And as it involves complex and expensive algorithms, and
is a moving target, you have to have the choice which
normalize() you call where.