[Date Prev][Date Next][Thread Prev][Thread Next]
- Subject: Re: Internationalisation in programming languages [Was Re: lex patch]
- From: RLake@...
- Date: Fri, 5 Apr 2002 18:04:35 -0500
Björn De Meyer escribió:
> What I find strange in all this talk about internationalizing the
> variable names, nobody talks about internationalizing the lua
> strings themselves. I would like lua to be compatible with UTF-8,
> for instance... But I doubt this is already the case.
That depends on what you mean by "compatible with UTF-8".
A Lua string is any linear vector of whatever the compiler says a char is
(unless you change that in the Lua config in which case you might be able
to change char to something else.) So a string has to be linearly
addressable, and that's about it. It is not restricted by alphabet, so a
string can have embedded 0's or whatever.
The Lua standard string library uses locale-dependent library calls to
decide what is a letter, whitespace, etc., and the basic compiler to decide
what is a \n.
So it will support UTF-8 in the sense that any UTF-8 sequence is a valid
string. Since UTF-8 maps low-ascii onto low-ascii, it will also support
UTF-8 in search strings in the C locale.
But that's about it. Of course, there is nothing stopping you from writing
library routines to do whatever UTF-8 manipulation you want to do.
Should it do more? I don't think so, because:
-- it is useful to have a datatype which does not get parsed all the time
-- it is not at all clear that UTF-8 is a better internal storage encoding
than, say, UTF-16 or UTF-32
-- there is no obvious way to handle "invalid string encodings" in the
I would advocate building Unicode libraries on top of the string datatype
rather than as built in to the string datatype. I don't actually oppose
Unicode but I think that it is not well-thought-out from a parsing
perspective, and as mentioned earlier, I have many concerns about mindless