Re: question about Unicode

lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

Subject: Re: question about Unicode
From: Rici Lake <lua@...>
Date: Thu, 7 Dec 2006 16:40:55 -0500


On 7-Dec-06, at 3:44 PM, Brian Weed wrote:

Asko Kauppi wrote:
But there may be some identifier "stamp" that can be used to know afile is UTF-8, no?
There are two that I know of. I don't know how "standard" they are.One is called a BOM Header, which is some binary code in the first 2bytes of the "text" file. The other is the occurrence of this text"charset=utf-8", anywhere in the file (at least according to theeditor I use: UltraEdit).

OK, another delve into the intricacies of Unicode. An "encoding form"is a mapping between sequences of numbers in some word size and unicodecharacters. There are three of these, corresponding to 8-bit, 16-bit,and 32-bit numbers.

However, serializing a sequence of numbers into a sequence of bytes issubject to the vagaries of endianness, so Unicode also defines"encoding schemes", which is a specification of a byte-serializedstring in some encoding form. There are seven character encodingschemes defined by Unicode (and several others which are in less commonuse): UTF-8, UTF-16, UTF-16BE, UTF-16LE, UTF-32, UTF-32BE and UTF-32LE.

BE and LE refer to endianness; if a string is advertised as being inUTF-16BE, for example, the 16-bit numbers are unambiguously serializedin big-endian (i.e. network order).

The two encoding schemes with unspecified endianness, UTF-16 andUTF-32, *may* start with 0xFEFF, the so-called Byte Order Mark (BOM).If it does, the BOM reveals the endianness of the encoding, and doesnot form part of the data stream. (If it doesn't start with a BOM, thedata must be serialized in big-endian format.)

U+FEFF is a valid Unicode character, the rather Zen "Zero WidthNo-Break Space". If a stream advertised as being UTF-8, UTF-16BE,UTF-16LE, UTF-32BE, or UTF-32LE starts with U+FEFF, then that charactermust be passed on to the application, which will presumably ignore itsince it's hard to know how else to process such a character. (However,it would form part of a MAC signature, if you were constructing one.)

You cannot tell with absolute rigor whether a stream is UTF-16 orUTF-32 by examining the first bytes, because U+0000 is a legalcharacter; thus, a stream start 0x00 0x00 0xFE 0xFF could be big-endian16-bit representing the characters NUL, ZWNBS, or it could be a UTF-32BOM. Similarly, a stream starting 0xFF 0xFE 0x00 0x00 could be alittle-endian 16-bit BOM followed by a NUL, or a little-endian 32-btBOM.

A UTF-8 stream might start with a ZWNBS (a practice the UnicodeConsortium "neither requires nor recommends"), but it would beinterpreted as a ZWNBS (part of the character stream) and not a BOM.This would be a pretty good indication that the stream was UTF-8(although it could be the unlikely iso-8859-1 sequence ï«¿).

All this may seem like sophistry, but it is important if you're doingdigital signatures. Unicode assumes that there will be some externalindication of how a byte stream is to be encoded such as a MIME headeror XML declaration.

Far and away the simplest mechanism is to require that characterstrings used in data exchange be in UTF-8. I'd certainly be quite happyfor Lua to insist on UTF-8 as a source file encoding format (rejectinginvalid byte sequences); transcoding could be left to utilities likeiconv which seem to be pretty universally available.

References:
- question about Unicode, Roberto Ierusalimschy
- Re: question about Unicode, Matt Campbell
- Re: question about Unicode, Roberto Ierusalimschy
- Re: question about Unicode, David Jones
- Re: question about Unicode, Roberto Ierusalimschy
- Re: question about Unicode, David Given
- Re: question about Unicode, Rici Lake
- Re: question about Unicode, Roberto Ierusalimschy
- Re: Re: question about Unicode, Ken Smith
- Re: question about Unicode, Adrian Perez
- Re: question about Unicode, Asko Kauppi
- Re: question about Unicode, Brian Weed

Prev by Date: Re: question about Unicode
Next by Date: Re: question about Unicode
Previous by thread: Re: question about Unicode
Next by thread: Re: question about Unicode
Index(es):
- Date
- Thread