lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

Klaus Ripke <> writes:

> On Thu, Apr 26, 2007 at 11:00:46AM +0200, David Kastrup wrote:
>> And so on.  slnunicode does not actually do much in the area of
>> verification.
> The statement is:
> "
> According to we support up to 4-byte
> (21 bit) sequences encoding the UTF-16 reachable 0-0x10FFFF.
> Any byte not part of a 2-4 byte sequence in that range decodes to itself.
> Ill formed (non-shortest) "C0 80" will be decoded as two code points C0 and 80,
> not code point 0; see security considerations in the RFC.
> However, UTF-16 surrogates (D800-DFFF) are accepted.
> "
> Decode-encode always gives valid UTF-8.

But it is not an unambigous representation of the input.  Personally,
I favor the strategy "any byte not part of a legal minimal 1-4 byte
sequence decodes to 0x1100xx, and values 0x1100xx encode as single
bytes xx again".  Note that the utf-8 coding algorithm easily supports
values in that range, so one can still do string manipulation as
usual.  It is also easy to weed out/flag illegal bytes.  It also means
that one has different procedures for encoding into internally used
(always valid, except that it may contain patterns for 0x1100xx) utf-8
(which is basically a packed array representation) and external utf-8
(arbitrary bytes may be produced by reencoding).

The disadvantage is that illegal bytes need 4 bytes for
representation: that means that decoding garbage might blow up the
byte count by a maximum of 4.

The advantage is that processing can rely on characteristics of the

>> It is not all too clear in my opinion how one could create a small
>> footprint Lua that supported byte arrays (if you want to, unibyte
>> strings) and multi-byte character strings where the characters
>> actually formed atomic string components.
> slnunicode supports both modes.
> The footprint is mostly about 12K for the unicode character table.

Please note that slnunicode does not really procure strings where the
atomic elements are Unicode characters: string indices and similar
things are always byte-based.

This is more or less what the first iteration of multibyte-support in
Emacs 20 was like.  People hated it.

>> In short: proper utf-8 support comes at a price, and even large
>> closely related projects don't arrive at the same solutions.
> well, the UTF-8 encoding is not the hard part.
> slnunicode is lacking a lot of unicode features like special casing,
> canonical de/composition and collations.

I am more worried about the indexing and atomicity of string
characters.  For the programmer, no model except a packed array of
unicode-characters makes sense.

As soon as you have to continuously worry about byte counts instead of
character counts, the complexity of the programming model explodes.

David Kastrup