[Date Prev][Date Next][Thread Prev][Thread Next]
- Subject: UTF-8 [was Re: LuaSocket http ftp smpt...]
- From: Klaus Ripke <paul-lua@...>
- Date: Thu, 3 Feb 2005 12:17:11 +0100
On Thursday 03 February 2005 10:06, PA wrote:
> > The most common charset is ISO-8859-1, which is enough for us Latin
> > people.
well here in Berlin, Western Latin is not enough, since Eastern Latin
is only 80km away :)
> I need UTF-8 at a bare minimum.
As I could not find any UTF-8 lib
(and http://lua-users.org/wiki/LuaUnicode is up to date ?),
I next to started to port Tcl's implementation of both UTF-8
(which basically is the 2K character classification and casing table
from Plan 9's rune system) and encoding
(which imho also is done in a quite reasonable fashion).
I would do this in a completely locale-independent way;
as for the casing there are only very few locale issues
(actually the I-dot thing in Turkish and Azeri and a similar
case in Lithuanian. Since here we have kind of a
German-Turkish locale, we would get it wrong anyway).
So, casing and character class detection for matching is mostly
easy (well, the latter has issues with combining diacritical marks ...).
Normalization would be out of the game as this is a heavy beast.
For collation I'd go for a port of our free defined collations
http://malete.org/Doc/CharSet (which does not yet include
multilevel collation but is otherwise quite small and efficient.
Basically I'm doing that in order to put Lua in our DB anyways).
Er, maybe it was not clear: the interface oviously is so that
you can say "string = utf8" and voila.
Encoding would be independent from UTF-8:
while recoding from A to B would go via UTF-8,
there is no need to know anything about the intermediate
stage (beyond recognizing the byte sequences).
Suggestions? Pointers? Stop it, stupid? Go on?