lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

On Feb 18, 2005, at 16:16, David Given wrote:

Basically, Lua doesn't know about encodings. Lua strings are streams of bytes, and it assumes that one character is one byte. Collation is done using the
byte value.


This means that you can put any kind of data in a string --- but it's your
responsibility to manipulate it correctly and do any conversion.

Ah... this is the major catch.

For example, if you're storing UTF8 in a Lua string (which is the recommended
way of doing Unicode in Lua), then you can't assume that you can read
character n by looking at byte n. *However*, string substitutions and pattern matching will still work in a limited way. The regular expression ".*fnord.*" will still match any string containing 'fnord', regardless of whether there are multibyte characters in the string; likewise, the pattern ".*©.*" will work; but "©*" won't work, because the * will bind to the last byte of the multibyte character. The collation functions will still work on single-byte
characters but will sort multibyte characters oddly. And so on.

So basically, UTF-8 renders most Lua core functionality useless as soon as one venture beyond US-ASCII, broadly speaking?

If you're writing a web server, then your best bet is to emit UTF8,

I can do that.

 and avoid
doing any string slicing;

I need to do that. A major feature of the app is search.

if you write your Lua scripts in UTF8, then you can
trivially include UTF8 sequences in constant strings:

 local s = "fóö"

Since HTTP can be driven entirely with US-ASCII, then this probably won't
cause you any problems.

Thanks for the explanations :)


PA, Onnay Equitursay