lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


Roberto Ierusalimschy wrote:
>
> But I guess the easiest way to use Unicode in Lua is with a multibyte
> representation (e.g. UTF-8). Then, you mainly (only?) need a new string
> library; everything else should work without modifications.
> 

Yes, this UTF-8 is most definitely the most straightformard way 
to support "Unicode" in Lua. UTF-8 has some very interesting qualities,
such as  the lack of embedded null characters, and being backwards 
compatible with ASCII-7 (but not with 8-bit character sets.) You can 
use the same string functions you use on regular 8 bit strings to work 
with UTF-8 strings. The main difficulty with UTF-8 is that one 
character may be 1, 2, 3 or 4 bytes long.

IMO, the following Lua stringg function should already be 
UTF-8 compatible (if they have been implmented cleanly):

* strfind (s, pattern [, init [, plain]])
* strlower (s)
* strupper (s)
* strrep (s, n)
* format (formatstring, e1, e2, ...)
* gsub (s, pat, repl [, n])

The following would probably need to be altered :

* strbyte (s [, i]): When in an UTF8 locale, strbyte should 
not return the i-the byte, but the i-th character in s, as 
in UTF8, 1 character may . 

* strchar (i1, i2, ...): In UTF8 locale, this should translate
i1, i2, etc if they have a value above 127 to  the corresponding 
multibyte encoding.

* strlen (s): In UTF-8 this should count the amounnt of characters, 
not the amoubnt of bytes.

* strsub (s, i [, j]): Again I and J should be able to be expressed 
as character counts, not as byte indexes.

However, Lua is so flexible, that I think it would be possible 
to implement these modifications in Lua itself. It's been quite a 
while since I worked with UTF8, but if there is more interest, 
I might be willing to cooperate to get this integrated into lua.








-- 
"No one knows true heroes, for they speak not of their greatness." -- 
Daniel Remar.
Björn De Meyer 
bjorn.demeyer@pandora.be