lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


On Friday 18 February 2005 14:35, PA wrote:
> On Feb 18, 2005, at 12:36, Glenn Maynard wrote:
> > A tip: using a real name on technical lists will tend to get you
> > a better response.
>
> Right... this is email... there is no such a thing as a "read name" :)

I beg to differ; everyone has a real name, and it's usually considered good 
etiquette to use them on a technical list (as opposed to a social list)... 
not an important point, however.

[...]
> Ok. So Lua's encoding reflects the OS encoding?

Basically, Lua doesn't know about encodings. Lua strings are streams of bytes, 
and it assumes that one character is one byte. Collation is done using the 
byte value.

This means that you can put any kind of data in a string --- but it's your 
responsibility to manipulate it correctly and do any conversion.

For example, if you're storing UTF8 in a Lua string (which is the recommended 
way of doing Unicode in Lua), then you can't assume that you can read 
character n by looking at byte n. *However*, string substitutions and pattern 
matching will still work in a limited way. The regular expression ".*fnord.*" 
will still match any string containing 'fnord', regardless of whether there 
are multibyte characters in the string; likewise, the pattern ".*©.*" will 
work; but "©*" won't work, because the * will bind to the last byte of the 
multibyte character. The collation functions will still work on single-byte 
characters but will sort multibyte characters oddly. And so on.

If you use fixed-length encodings such as UCS2 or UCS4 then of course the 
pattern matching functions become useless to you.

Anything from the ISO8859 family is trivial, of course.

If you're writing a web server, then your best bet is to emit UTF8, and avoid 
doing any string slicing; if you write your Lua scripts in UTF8, then you can 
trivially include UTF8 sequences in constant strings:

 local s = "fóö"

Since HTTP can be driven entirely with US-ASCII, then this probably won't 
cause you any problems.

-- 
+- David Given --McQ-+ "There is // One art // No more // No less // To
|  dg@cowlark.com    | do // All things // With art // Lessness." --- Piet
| (dg@tao-group.com) | Hein
+- www.cowlark.com --+