lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


On Thu, Apr 26, 2007 at 11:00:46AM +0200, David Kastrup wrote:
> And so on.  slnunicode does not actually do much in the area of
> verification.
The statement is:
"
According to http://ietf.org/rfc/rfc3629.txt we support up to 4-byte
(21 bit) sequences encoding the UTF-16 reachable 0-0x10FFFF.
Any byte not part of a 2-4 byte sequence in that range decodes to itself.
Ill formed (non-shortest) "C0 80" will be decoded as two code points C0 and 80,
not code point 0; see security considerations in the RFC.
However, UTF-16 surrogates (D800-DFFF) are accepted.
"

Decode-encode always gives valid UTF-8.
It does not drop invalid unicode characters like the surrogates,
but that can be achieved using a match.
Testing for proper use of combining diacritical marks is less easy.

> It is not all too clear in my opinion how one could create a small
> footprint Lua that supported byte arrays (if you want to, unibyte
> strings) and multi-byte character strings where the characters
> actually formed atomic string components.
slnunicode supports both modes.
The footprint is mostly about 12K for the unicode character table.

> In short: proper utf-8 support comes at a price, and even large
> closely related projects don't arrive at the same solutions.
well, the UTF-8 encoding is not the hard part.
slnunicode is lacking a lot of unicode features like special casing,
canonical de/composition and collations.


regards
Klaus