lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


(I second Tuomo's rant:
 > <ot-rant>They should just have sticked to mapping basic glyphs to 
numbers
 > and leave the rest to higher-level formats.
 > </ot-rant>
)

> But two identical utf-8 characters can have different encoding, right?
> So two strings can contain the same characters but different byte
> sequences and hence by not be equal.

Yes and no. There is no such thing as a UTF-8 character; UTF-8 is an
encoding. Furthermore, it is a unique encoding: every Unicode unit
has a single conformant representation in UTF-8. (I say "unit" instead
of "character" because the latter is ambiguous if not downright
misleading [Note 1], but you will see "character" in all the Unicode
standards.)

The algorithm which guarantees this is simple but subtle. The Unicode
code range is a discontiguous subset of [0, 17*2^16) where some code
points are assigned, some are unassigned, some are illegal, and some
are "surrogates" which could be present in UTF-16 (they are two-code
sequences for single Unicode units whose numbers exceed 16 bits), but
may not be present in UTF-8 or UTF-32.

So a naïve UTF-8 decoder might generate the same Unicode unit from
two different bitstreams, but that would be an error and would make
the UTF-8 decoder non-conformant; in addition, it might generate
illegal code points within the code range or values outside of the
code range, which are also illegal (and non-conformant).

On top of this is the question of what character string equality means.
Is "llama" equal to "Llama"? Is it equal to "LLama"? Is it equal to
"LLAMA"? How about "lLama"? (Answer: it depends on your locale, and
the answer might surprise you if your locale is "Spanish traditional
sort"; in the latter case, Microsoft SQL Server chokes on field names
like "MailLogFile")

<rant on-topic="vaguely">

Unfortunately, a UTF-8 unit might be more or less than a character,
depending on what you think of as a character, and there are strings
whose external visualisation is likely to be pixel-for-pixel identical
which can be expressed in more than one way. There are Unicode units
which have no representation whatsoever; some of these, like the
bidi (bidirectionality) units, affect the visual representation of
other units; at least one (U+FEFF, the "zero-width no-break space")
does almost nothing (by convention, it can be used to signal byte-order
at the beginning of UTF-16 code sequences, but I don't know of any rule
which says that it cannot be present in a Unicode sequence, where it
does not occupy space; does not signify a word break; and does not
inhibit ligature. (There are a variety of zero width spaces, with
different characteristics, as well as the ontologically interesting
U+2063 -- "Invisible separator" aka "invisible comma").

Unicode composing characters allow you to build up glyphs (visible
representations) out of bits and pieces; the process is not simply
placing ´ on top of a to produce á, but rather the selection of a
completely different glyph, possibly computed algorithmically. In
theory, you could pile an arbitrarily large number of ´ marks on
top of the same overburdened a, but in some representations they
might go beside each other, on top of each other, or float into
the hyperspace of bad design.

In a typical liberal excess of legacy support, certain glyphs have
code points as well as being composable; to keep other interests
happy in the initial design phases, separate codes were assigned to
visually identical symbols which carry semantic information in some
contexts. So, for example, you can write Å as:
  U+212B  Angstrom Sign [Note 2]
  U+00C5  Latin Capital Letter A With Ring Above
or
  U+0041 U+030A  Latin Capital Letter A;
                 Combining Ring Above [Note 3]

[Note 2]: An Ångstrom is 10^-10 meters, or 0.1 nanometers, and is about
          the width of an atom; it was named after the Danish scientist
          Anders Jonas Ångström. The first letter of his surname is U+00C5,
          not U+212B, so using U+212B as an abbreviation is ridiculous.
.
[Note 3]: Note that U+030A is not the same as U+02DA (Ring Above) although
          the latter could be composed using U+0020 U+030A (Space, Ring 
Above)
          which is guaranteed to produce the same glyph.

(It gets worse, too.)
</rant>

Rici.