[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Unicode musings. Was Re: htmlentities table
- From: Rici Lake <lua@...>
- Date: Sat, 29 Oct 2005 21:42:42 -0500
On 28-Oct-05, at 6:56 PM, David Given wrote:
(I'm now trying to figure out what spec I was conflating it with that
consists of a sequence of high-bit characters followed by a
non-high-bit
character --- this is going to bother me all night.)
That would probably be the encoding used by ASN.1 for OID segments.
Although that encoding is denser than UTF-8 (that is, has less
overhead), it does not satisfy one of the key properties of UTF-8:
P1) No valid code sequence is a substring of another valid code
sequence (*).
(*) I'm using "code sequence" to mean the encoding of a single Unicode
code point, and "string" to mean the encoding of a sequence of Unicode
code points. These may not be standard vocabulary.
This property allows you to search a UTF-8 string for a UTF-8 code
sequence without worrying about false hits. In particular, seven-bit
characters only appear as themselves, which means that if you're
searching for, say, "a", you will only find "a"'s, and not bits of
longer code sequences which happen to include the octet for "a".
This is particularly importance because of "metacharacters": filepath
and url separators (/:? etc), special shell characters (&;(){}[]| and a
host of others), etc. Such characters cannot "sneak into" a UTF-8
sequence, as they can in other wide-character encodings. So a
filesystem, for example, does not have to be UTF-8 aware in order to
avoid false recognitions of /.
Care must be taken when using compatibility normalization forms,
however. Compatibility normalization of a Unicode string can create
"accidental" metacharacters; for example, the compatibility
normalization of ½ (one-half) is 1/2, which creates a filepath/url
separator more or less out of thin air. The stringprep/punycode
community seem to believe that compatibility normalization "solves" the
"unicode spoofing" security issue, but I'm not convinced; it seems to
me that it trades one problem for another.
The other two important properties of UTF-8:
P2) There is a one-to-one correspondence between codes and code
sequences
In the initial definition of UTF-8, this was no clear, but it has been
part of the specification for quite a while now. Consequently, you
cannot represent / (U+002F) as anything other than the single code 2F.
(C0 2F, E0 80 2F and F0 80 80 2F are all invalid code sequences).
As I understand it, Java deliberately violated this property in order
to represent NUL (U+0000) characters without signalling an inadvertent
end-of-string to C libraries. (That is, it encodes U+0000 as C0 80,
which is not a valid UTF-8 code sequence.)
The ASN.1 OID encoding also includes this property, which is important
for fast key lookups.
P3) Code sequences compare lexicographically in the same order as codes
compare numerically.
This was one of the original design goals for UTF-8, and would be more
interesting if it was shared by UTF-16. Unfortunately, it isn't, since
the first decimosextet of a "surrogate pair" (i.e. the UTF-16
representation of a unicode values >= U+10000) is in the range
0xD800-0xDBFF. This could cause data correlation problems if you're
storing key data in naively sorted databases, some in UTF-16 and others
in UTF-8. But in general, in seems like a non-issue either way. In any
event, numeric order of unicode values is not useful for presentation
purposes, so the fact that UTF-8 mimics does not seem to be of much
use.