lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

Björn De Meyer escribió:

> What I find strange in all this talk about internationalizing the
> variable names, nobody talks about internationalizing the lua
> strings themselves. I would like lua to be compatible with UTF-8,
> for instance... But I doubt this is already the case.

That depends on what you mean by "compatible with UTF-8".

A Lua string is any linear vector of whatever the compiler says a char is
(unless you change that in the Lua config in which case you might be able
to change char to something else.) So a string has to be linearly
addressable, and that's about it. It is not restricted by alphabet, so a
string can have embedded 0's or whatever.

The Lua standard string library uses locale-dependent library calls to
decide what is a letter, whitespace, etc., and the basic compiler to decide
what is a \n.

So it will support UTF-8 in the sense that any UTF-8 sequence is a valid
string. Since UTF-8 maps low-ascii onto low-ascii, it will also support
UTF-8 in search strings in the C locale.

But that's about it. Of course, there is nothing stopping you from writing
library routines to do whatever UTF-8 manipulation you want to do.

Should it do more? I don't think so, because:

-- it is useful to have a datatype which does not get parsed all the time
-- it is not at all clear that UTF-8 is a better internal storage encoding
than, say, UTF-16 or UTF-32
-- there is no obvious way to handle "invalid string encodings" in the

I would advocate building Unicode libraries on top of the string datatype
rather than as built in to the string datatype. I don't actually oppose
Unicode but I think that it is not well-thought-out from a parsing
perspective, and as mentioned earlier, I have many concerns about mindless