lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


(Replying to the digest, so apologies for lack of threading.)

UTF-8 has the nice property that the NUL (zero) octet (usually
byte or char, although some DSPs have 32-bit chars...) never
occurs in a valid sequence, except as the ASCII/EBCDIC
character '\0', where it can happily serve as a terminator.

There is an alternative way of encoding zero (U+000000), which
takes two octets and avoids using a zero octet, but, sadly,
Unicode prohibits alternate (longer) encodings this as an invalid
sequence: 0xc0 0x80.

For most code points, the alternate-encoding prohibition is a very
welcome property, as it makes input validation easier.

However, people have noticed this overlap, have made it a fairly
formal informal standard called "Modified UTF-8", and Wikipedia
notes the existence of "Modified UTF-8" implementations in various
places, including Java components, and Tcl internals (remember,
however, that citing Wikipedia is NOT the same as citing a
trustworthy standard).

With Modified UTF-8, you both get clear NUL octet encoding as two
octets, but also allow a NUL octet to be appended as C string
terminator, which can ease using legacy C interfaces.

-- sur-b.