lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


Hi

On Wednesday 16 February 2005 18:08, PA wrote:
> So... to have Lua consistently behave in en_US UTF-8, would the
> following setting "just work"?!?
uh!
don't!

> os.setlocale( "en_US" )
better set C or POSIX

> os.setlocale( "UTF-8", "collate" )
this kind of works, but slows down strcoll by an order of magnitude.
Reason is that even every pure ASCII string has to pass through
the complete UCA (Unicode collation algorithm) just in case it
might be compared to some chinese characters.
In many applications like doing a binary search,
you would want to call strxfrm for your search string once
and use the transformed string many times.
With strcoll, however, strxfrm is called internally for every
comparision -- sloooow.

Even when using strxfrm, there is no way to go for a simple,
reduced collation -- you will always get the full three and 1/2
level comparision with collation code points for every character
including etruscian.

That's why we're using a much simpler approach in malete
and as announced I'm going to do a lua binding.

> os.setlocale( "UTF-8", "ctype" )
This one's not going to work at all.
It will not affect any of the calls used in lstrlib.
On the C side you would need to use the wchar_t
interface, which IMHO is a complete mess.

That's why at least myself did not write a generic "wstring" lib
-- I don't believe in the "standard" setlocale anyway,
as it has to many, hum, "interesting" side-effects.

So check this out:
http://malete.org/tar/slnutf8.0.8.tar.gz (15K)

It includes slnutf8.c, slnudata.c and a litlle test script
(better run in a xterm -u8; be careful about luit setting).
Could be used as a replacement for string,
but this is probably not advisable as it might heavily
confuse some libs.
Adds about 21K when statically linked to a intel/linux/
dietlibc binary (without SLNUTF8_USESTRING defined).

What is done:
--  len(str [,mode=1])
--  sub(str, start [,end=-1 [,mode=1]])
--  mode is: 0=bytewise (as in string), 1=by code point, 2=by grapheme cluster
--  lower(str)
--  upper(str)
--  char(i [,j...])
Counting by grapheme cluster takes care of most of the
Grapheme_Extend properties (which basically means to treat
an 'a' followed by a combining circumflex as one character),
ignoring some special cases (Other_Grapheme_Extend) and the
Hangul syllable type (please go ahead and add this who needs it).

The UTF-8 sequence detection is very strict in that it does not
allow invalid sequences like e.g. encoding ASCII characters
with more than on byte, thus addressing the security concerns
mentioned in http://ietf.org/rfc/rfc3629.txt .
There is one exception, which is to pass the UTF-16 surrogates.
These should not cause any harm and are sometimes found
in variants like CESU-8.

Upper and lower ignore the specialcasing; add this as needed.
Specialcasing also includes a few "locale" dependent variants.
However, should someone want to take care of these, I'd object to
checking the libc locale and rather favour utf8.specialcasing("tr").

TODO: byte
This is very easily done, I'm just considering whether to keep the name,
or instead use "code". A "code" function could also include a switch
to decode CESU-8 style embedded UTF-16 surrogates.
Suggestions?

TODO: matching
This appears to actually be not too difficult. Once singlematch learns
that it might affect multiple bytes, the rest should just pass as is.
More interesting question is: since the character categories are based on
http://unicode.org/Public/UNIDATA/UCD.html#General_Category_Values
(which *are* locale independent), it would be nice to allow these
categories as character class primitives. Hmmm ... Comments?

TODO: port Tcl's tools/uniParse.tcl to lua
This is the script used by Tcl to create the slnudata.c
from the Unicode character database. Volunteers?

TODO: collation
here we are again.
Frankly speaking I heavily object to the strcoll in lvm.c and in fact
compile this with strcoll defined to strcmp. There might be occasions
where you want simple fast reliable bytewise comparision,
especially where your string does not contain character data,
like being the result of strxfrm (sic!). This would be broken by
setting the locale - ouch. In other words, here using locale sensitive
calls at the lowest level *prevents* careful use of locale settings
like maintaining an index of strxfrmed data.

Whether in string or utf8, it should be an explicit "coll" function,
together with a matching "xfrm" function -- and my suggestion is
to not add this to lstrlib, but instead create a "locale" package
to contain such stuff.

Since the special casings (although mostly local independent)
involve similar sequence detection techniques, I'm going to wrap
these up together with collation package.

TODO: encodings
These would be four different packages for single-byte,
double-byte, multi-byte and escape-driven.
At least single-byte is small and simple and shall be provided soon.
As far as I am concerned, I would only statically link the single-byte
lib to my stuff; the rest you need, well, when you need it.


saludos
Klaus