C strings, NULs (not NULLs)... and "modified UTF-8"

lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

Subject: C strings, NULs (not NULLs)... and "modified UTF-8"
From: sur-behoffski <sur_behoffski@...>
Date: Fri, 8 Jul 2016 19:17:37 +0930

(Replying to the digest, so apologies for lack of threading.)

UTF-8 has the nice property that the NUL (zero) octet (usually
byte or char, although some DSPs have 32-bit chars...) never
occurs in a valid sequence, except as the ASCII/EBCDIC
character '\0', where it can happily serve as a terminator.

There is an alternative way of encoding zero (U+000000), which
takes two octets and avoids using a zero octet, but, sadly,
Unicode prohibits alternate (longer) encodings this as an invalid
sequence: 0xc0 0x80.

For most code points, the alternate-encoding prohibition is a very
welcome property, as it makes input validation easier.

However, people have noticed this overlap, have made it a fairly
formal informal standard called "Modified UTF-8", and Wikipedia
notes the existence of "Modified UTF-8" implementations in various
places, including Java components, and Tcl internals (remember,
however, that citing Wikipedia is NOT the same as citing a
trustworthy standard).

With Modified UTF-8, you both get clear NUL octet encoding as two
octets, but also allow a NUL octet to be appended as C string
terminator, which can ease using legacy C interfaces.

-- sur-b.

Follow-Ups:
- Re: C strings, NULs (not NULLs)... and "modified UTF-8", Rena

Prev by Date: Re: Compiler warnings when dealing with huge integers
Next by Date: Re: New array type? (was: 'table' as fallback for tables)
Previous by thread: Re: Script for modifying luaconf.h
Next by thread: Re: C strings, NULs (not NULLs)... and "modified UTF-8"
Index(es):
- Date
- Thread