[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Re: lua for unicode
- From: "Peter Loveday" <peter@...>
- Date: Sat, 30 Nov 2002 11:45:05 -0500
> Exactly. Unicode doesn't even support the full range of (Chinese)
> Kanji characters used in Japanese. Unicode was in the beginning an
> evil microsoft invention, in which they tries to implement all languages
> in just 65000 characters.
It may be true that Unicode doesn't currenly support all characters, however
it certainly supports a lot more than 65000. You may be thinking of the
UCS-2 variant, as used by Windows, which is actually even more limited than
this; but Unicode supports a lot more than that.
According to unicode.org (the official site for unicode) The current unicode
spec (3.2) defines codes for 95,221 characters.
Unicode exists in many formats; UCS-2 was the old way some systems/programs
supported it, but this does not allow for all unicode codes to be used. The
modern formats are usually UTF-8, UTF-16, or UTF-32, which use combinations
of characters and surrogate pairs to represent the full code range.
It may well be that not all characters have been assigned codes in unicode,
but it still has a huge number of available codes; I believe the current
spec allows for codes in the range U+0 through U+10FFFF, or 1,114,112
possible characters.
Using UTF-8 in Lua is attractive because it is already 8 bit clean.
Certainly the string lib needs to be updated to deal with a character being
between 1 and 4 bytes, as the UTF-8 spec defines, and there are some
collation issues that are very complex. IBM have an excellent freely usable
unicode library available at http://oss.software.ibm.com/icu/ that deals
with all these issues, but this is much to large to utilise in a standard
Lua distribution. Still adding basic abilities to handle UTF-8 without
proper collation would not be too hard, perhaps with an add-on module that
supports full unicode handling.
Love, Light and Peace,
- Peter Loveday
Director of Development, eyeon Software