Re: C strings, NULs (not NULLs)... and "modified UTF-8"

lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

Subject: Re: C strings, NULs (not NULLs)... and "modified UTF-8"
From: Rena <hyperhacker@...>
Date: Sun, 10 Jul 2016 10:59:31 -0400

On Jul 8, 2016 5:48 AM, "sur-behoffski" <sur_behoffski@grouse.com.au> wrote:
>
> (Replying to the digest, so apologies for lack of threading.)
>
> UTF-8 has the nice property that the NUL (zero) octet (usually
> byte or char, although some DSPs have 32-bit chars...) never
> occurs in a valid sequence, except as the ASCII/EBCDIC
> character '\0', where it can happily serve as a terminator.
>
> There is an alternative way of encoding zero (U+000000), which
> takes two octets and avoids using a zero octet, but, sadly,
> Unicode prohibits alternate (longer) encodings this as an invalid
> sequence: 0xc0 0x80.
>
> For most code points, the alternate-encoding prohibition is a very
> welcome property, as it makes input validation easier.
>
> However, people have noticed this overlap, have made it a fairly
> formal informal standard called "Modified UTF-8", and Wikipedia
> notes the existence of "Modified UTF-8" implementations in various
> places, including Java components, and Tcl internals (remember,
> however, that citing Wikipedia is NOT the same as citing a
> trustworthy standard).
>
> With Modified UTF-8, you both get clear NUL octet encoding as two
> octets, but also allow a NUL octet to be appended as C string
> terminator, which can ease using legacy C interfaces.
>
> -- sur-b.
>

What use is this? Lua strings are already binary safe (able to contain embedded NULs).

References:
- C strings, NULs (not NULLs)... and "modified UTF-8", sur-behoffski

Prev by Date: Re: Long-term LPeg memoisation (Julien Desgats)
Next by Date: LuaConf
Previous by thread: C strings, NULs (not NULLs)... and "modified UTF-8"
Next by thread: Lua C Functions
Index(es):
- Date
- Thread