lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


On Jul 8, 2016 5:48 AM, "sur-behoffski" <sur_behoffski@grouse.com.au> wrote:
>
> (Replying to the digest, so apologies for lack of threading.)
>
> UTF-8 has the nice property that the NUL (zero) octet (usually
> byte or char, although some DSPs have 32-bit chars...) never
> occurs in a valid sequence, except as the ASCII/EBCDIC
> character '\0', where it can happily serve as a terminator.
>
> There is an alternative way of encoding zero (U+000000), which
> takes two octets and avoids using a zero octet, but, sadly,
> Unicode prohibits alternate (longer) encodings this as an invalid
> sequence: 0xc0 0x80.
>
> For most code points, the alternate-encoding prohibition is a very
> welcome property, as it makes input validation easier.
>
> However, people have noticed this overlap, have made it a fairly
> formal informal standard called "Modified UTF-8", and Wikipedia
> notes the existence of "Modified UTF-8" implementations in various
> places, including Java components, and Tcl internals (remember,
> however, that citing Wikipedia is NOT the same as citing a
> trustworthy standard).
>
> With Modified UTF-8, you both get clear NUL octet encoding as two
> octets, but also allow a NUL octet to be appended as C string
> terminator, which can ease using legacy C interfaces.
>
> -- sur-b.
>

What use is this? Lua strings are already binary safe (able to contain embedded NULs).