lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


On 28-Oct-05, at 6:56 PM, David Given wrote:

(I'm now trying to figure out what spec I was conflating it with that
consists of a sequence of high-bit characters followed by a non-high-bit
character --- this is going to bother me all night.)

That would probably be the encoding used by ASN.1 for OID segments.

Although that encoding is denser than UTF-8 (that is, has less overhead), it does not satisfy one of the key properties of UTF-8:

P1) No valid code sequence is a substring of another valid code sequence (*).

(*) I'm using "code sequence" to mean the encoding of a single Unicode code point, and "string" to mean the encoding of a sequence of Unicode code points. These may not be standard vocabulary.

This property allows you to search a UTF-8 string for a UTF-8 code sequence without worrying about false hits. In particular, seven-bit characters only appear as themselves, which means that if you're searching for, say, "a", you will only find "a"'s, and not bits of longer code sequences which happen to include the octet for "a".

This is particularly importance because of "metacharacters": filepath and url separators (/:? etc), special shell characters (&;(){}[]| and a host of others), etc. Such characters cannot "sneak into" a UTF-8 sequence, as they can in other wide-character encodings. So a filesystem, for example, does not have to be UTF-8 aware in order to avoid false recognitions of /.

Care must be taken when using compatibility normalization forms, however. Compatibility normalization of a Unicode string can create "accidental" metacharacters; for example, the compatibility normalization of ½ (one-half) is 1/2, which creates a filepath/url separator more or less out of thin air. The stringprep/punycode community seem to believe that compatibility normalization "solves" the "unicode spoofing" security issue, but I'm not convinced; it seems to me that it trades one problem for another.

The other two important properties of UTF-8:

P2) There is a one-to-one correspondence between codes and code sequences

In the initial definition of UTF-8, this was no clear, but it has been part of the specification for quite a while now. Consequently, you cannot represent / (U+002F) as anything other than the single code 2F. (C0 2F, E0 80 2F and F0 80 80 2F are all invalid code sequences).

As I understand it, Java deliberately violated this property in order to represent NUL (U+0000) characters without signalling an inadvertent end-of-string to C libraries. (That is, it encodes U+0000 as C0 80, which is not a valid UTF-8 code sequence.)

The ASN.1 OID encoding also includes this property, which is important for fast key lookups.

P3) Code sequences compare lexicographically in the same order as codes compare numerically.

This was one of the original design goals for UTF-8, and would be more interesting if it was shared by UTF-16. Unfortunately, it isn't, since the first decimosextet of a "surrogate pair" (i.e. the UTF-16 representation of a unicode values >= U+10000) is in the range 0xD800-0xDBFF. This could cause data correlation problems if you're storing key data in naively sorted databases, some in UTF-16 and others in UTF-8. But in general, in seems like a non-issue either way. In any event, numeric order of unicode values is not useful for presentation purposes, so the fact that UTF-8 mimics does not seem to be of much use.