lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]



On 7-Dec-06, at 3:44 PM, Brian Weed wrote:

Asko Kauppi wrote:
But there may be some identifier "stamp" that can be used to know a file is UTF-8, no?
There are two that I know of. I don't know how "standard" they are. One is called a BOM Header, which is some binary code in the first 2 bytes of the "text" file. The other is the occurrence of this text "charset=utf-8", anywhere in the file (at least according to the editor I use: UltraEdit).

OK, another delve into the intricacies of Unicode. An "encoding form" is a mapping between sequences of numbers in some word size and unicode characters. There are three of these, corresponding to 8-bit, 16-bit, and 32-bit numbers.

However, serializing a sequence of numbers into a sequence of bytes is subject to the vagaries of endianness, so Unicode also defines "encoding schemes", which is a specification of a byte-serialized string in some encoding form. There are seven character encoding schemes defined by Unicode (and several others which are in less common use): UTF-8, UTF-16, UTF-16BE, UTF-16LE, UTF-32, UTF-32BE and UTF-32LE.

BE and LE refer to endianness; if a string is advertised as being in UTF-16BE, for example, the 16-bit numbers are unambiguously serialized in big-endian (i.e. network order).

The two encoding schemes with unspecified endianness, UTF-16 and UTF-32, *may* start with 0xFEFF, the so-called Byte Order Mark (BOM). If it does, the BOM reveals the endianness of the encoding, and does not form part of the data stream. (If it doesn't start with a BOM, the data must be serialized in big-endian format.)

U+FEFF is a valid Unicode character, the rather Zen "Zero Width No-Break Space". If a stream advertised as being UTF-8, UTF-16BE, UTF-16LE, UTF-32BE, or UTF-32LE starts with U+FEFF, then that character must be passed on to the application, which will presumably ignore it since it's hard to know how else to process such a character. (However, it would form part of a MAC signature, if you were constructing one.)

You cannot tell with absolute rigor whether a stream is UTF-16 or UTF-32 by examining the first bytes, because U+0000 is a legal character; thus, a stream start 0x00 0x00 0xFE 0xFF could be big-endian 16-bit representing the characters NUL, ZWNBS, or it could be a UTF-32 BOM. Similarly, a stream starting 0xFF 0xFE 0x00 0x00 could be a little-endian 16-bit BOM followed by a NUL, or a little-endian 32-bt BOM.

A UTF-8 stream might start with a ZWNBS (a practice the Unicode Consortium "neither requires nor recommends"), but it would be interpreted as a ZWNBS (part of the character stream) and not a BOM. This would be a pretty good indication that the stream was UTF-8 (although it could be the unlikely iso-8859-1 sequence ï«¿).

All this may seem like sophistry, but it is important if you're doing digital signatures. Unicode assumes that there will be some external indication of how a byte stream is to be encoded such as a MIME header or XML declaration.

Far and away the simplest mechanism is to require that character strings used in data exchange be in UTF-8. I'd certainly be quite happy for Lua to insist on UTF-8 as a source file encoding format (rejecting invalid byte sequences); transcoding could be left to utilities like iconv which seem to be pretty universally available.