[Date Prev][Date Next][Thread Prev][Thread Next]
- Subject: OT: (of Lua) Re: Unicode?
- From: RLake@...
- Date: Thu, 12 Jun 2003 18:18:43 -0500
> This is true, but it is hard to see how it could be otherwise, or indeed
> why this is bad in the general case. For example, U+0041 (LATIN CAPITAL
> LETTER A) and U+0410 (CYRILLIC CAPITAL LETTER A) have the same glyph,
> but different encodings.
Well, let's take the example from my last e-mail:
Suppose I have three strings: "Ångstrom", "Ångstrom", and "Ångstrom".
(The first one starts with U+00C5, the second with U+212B, and the third
with U+0041U+030A, in case that didn't show up on your mail client.) Are
these strings equal? I would have liked to have said that they look
identical, but they don't, at leat on this machine, with this mail client
and this font (Windows NT / Lotus Notes / Lucida Sans Unicode 10 pt, as it
happens), where they look slightly different. (I think that is a bug in
the font -- from left to right, the ring is responding to the gravity of
the situation by slowing rolling off the A.) If anyone other than me can
see that at all, then I suppose Unicode might be getting somewhere. :)
Well, OK, that is a bit of a cheat because I think they actually turn into
the same string if you apply any Unicode Normalisation transformation. But
what about Cyrillic? (Or Greek, for that matter.) Do the identifiers "A",
"А", and "Α" refer to the same object or not? (That was U+0041, U+410 and
U+391, respectively.) What is the general case in which this is not a Bad
Thing? If you are referring to display of text, I would say that was a
pretty specific case. (Furthermore, there is nothing stopping me from
composing U+0410 with U+030A: "А̊", which will not normalise to "Å")
It could have been otherwise with a simple rule: 1 glyph == 1 code. It's
hard to see why you have to compose some but not all diacritics in
European languages, while more than eleven thousand codes are given over
to compositions of Hangul.