Unicode musings. Was Re: htmlentities table

lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

Subject: Unicode musings. Was Re: htmlentities table
From: Rici Lake <lua@...>
Date: Sat, 29 Oct 2005 21:42:42 -0500

On 28-Oct-05, at 6:56 PM, David Given wrote:

(I'm now trying to figure out what spec I was conflating it with that
consists of a sequence of high-bit characters followed by anon-high-bit
character --- this is going to bother me all night.)


That would probably be the encoding used by ASN.1 for OID segments.

Although that encoding is denser than UTF-8 (that is, has lessoverhead), it does not satisfy one of the key properties of UTF-8:

P1) No valid code sequence is a substring of another valid codesequence (*).

(*) I'm using "code sequence" to mean the encoding of a single Unicodecode point, and "string" to mean the encoding of a sequence of Unicodecode points. These may not be standard vocabulary.

This property allows you to search a UTF-8 string for a UTF-8 codesequence without worrying about false hits. In particular, seven-bitcharacters only appear as themselves, which means that if you'researching for, say, "a", you will only find "a"'s, and not bits oflonger code sequences which happen to include the octet for "a".

This is particularly importance because of "metacharacters": filepathand url separators (/:? etc), special shell characters (&;(){}[]| and ahost of others), etc. Such characters cannot "sneak into" a UTF-8sequence, as they can in other wide-character encodings. So afilesystem, for example, does not have to be UTF-8 aware in order toavoid false recognitions of /.

Care must be taken when using compatibility normalization forms,however. Compatibility normalization of a Unicode string can create"accidental" metacharacters; for example, the compatibilitynormalization of ½ (one-half) is 1/2, which creates a filepath/urlseparator more or less out of thin air. The stringprep/punycodecommunity seem to believe that compatibility normalization "solves" the"unicode spoofing" security issue, but I'm not convinced; it seems tome that it trades one problem for another.


The other two important properties of UTF-8:

P2) There is a one-to-one correspondence between codes and codesequences

In the initial definition of UTF-8, this was no clear, but it has beenpart of the specification for quite a while now. Consequently, youcannot represent / (U+002F) as anything other than the single code 2F.(C0 2F, E0 80 2F and F0 80 80 2F are all invalid code sequences).

As I understand it, Java deliberately violated this property in orderto represent NUL (U+0000) characters without signalling an inadvertentend-of-string to C libraries. (That is, it encodes U+0000 as C0 80,which is not a valid UTF-8 code sequence.)

The ASN.1 OID encoding also includes this property, which is importantfor fast key lookups.

P3) Code sequences compare lexicographically in the same order as codescompare numerically.

This was one of the original design goals for UTF-8, and would be moreinteresting if it was shared by UTF-16. Unfortunately, it isn't, sincethe first decimosextet of a "surrogate pair" (i.e. the UTF-16representation of a unicode values >= U+10000) is in the range0xD800-0xDBFF. This could cause data correlation problems if you'restoring key data in naively sorted databases, some in UTF-16 and othersin UTF-8. But in general, in seems like a non-issue either way. In anyevent, numeric order of unicode values is not useful for presentationpurposes, so the fact that UTF-8 mimics does not seem to be of muchuse.

References:
- htmlentities table, Walter Cruz
- Re: htmlentities table, David Given
- Re: htmlentities table, Rici Lake
- Re: htmlentities table, David Given

Prev by Date: Re: Lua and networking - possibly basic question
Next by Date: Re: 5.1alpha count and line hook
Previous by thread: Re: htmlentities table
Next by thread: Re: htmlentities table
Index(es):
- Date
- Thread