lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


On Friday 18 February 2005 18:14, PA wrote:
[...]
> So basically, UTF-8 renders most Lua core functionality useless as soon
> as one venture beyond US-ASCII, broadly speaking?

No. Most of the core functionality will continue to work fine. Anything that 
makes assumptions as to how many bytes a character takes won't work, but 
there's surprisingly little that does that.

For example:

 instring = "Hello, world!"
 instring = string.gsub(instring, "e", "é")
 instring = string.gsub(instring, "l", "ĺ")
 instring = string.gsub(instring, "o", "ø")
 instring = string.gsub(instring, "!", "¡")
 print(instring)

This works fine. string.gsub() doesn't care that the replacement strings are 
multibyte characters --- it's dealing purely with bytes. In fact, the above 
code will work with *all* character encodings that degrade into ASCII; 
ISO8859-n, UTF8, some of the Asian encodings, etc. I happened to use UTF8. 
The only caveat is that you have to save the script in the same encoding as 
the source file, because otherwise the constant strings in the script won't 
match.

Likewise:

 instring = "abcdëfghi"
 instring = string.gsub(instring, "ë", "e")
 print(instring)

...will also work. But:

 instring = "abcdëfghi"
 s = string.find(instring, "ë")
 print(string.sub(instring, s, 1))

Oop! This'll emit a broken character. Use this instead:

 instring = "abcdëfghi"
 s, e = string.find(instring, "ë")
 print(string.sub(instring, s, (e-s)))

See? I'm not making any assumptions about how long any strings are.

The only major issues is that [], * and + in regular expression won't work on 
multibyte characters (but they'll keep working on the single-byte characters 
surrounding the multibyte characters)... it's a shame that Lua's regular 
expressions don't support groups. Most regexp engines allow you to group 
stuff up like this:

 (abcd)*   matches   abcdabcdabcd

This would provide a very easy way to get regexps working with multibyte 
characters:

 (ë)*

The ë would expand into multiple bytes, and the regexp would Just Work. 
Unfortunately, that's not supported.

The other thing to be aware of is that string.upper() and string.lower() won't 
work, but nobody expects them to work on anything other than Latin scripts 
anyway.

-- 
+- David Given --McQ-+ "USER'S MANUAL VERSION 1.0:  The information
|  dg@cowlark.com    | presented in this publication has been carefully
| (dg@tao-group.com) | for reliability." --- anonymous computer hardware
+- www.cowlark.com --+ manual