lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


> This is true, but it is hard to see how it could be otherwise, or indeed
> why this is bad in the general case.  For example, U+0041 (LATIN CAPITAL
> LETTER A) and U+0410 (CYRILLIC CAPITAL LETTER A) have the same glyph,
> but different encodings.

Well, let's take the example from my last e-mail:

Suppose I have three strings: "Ångstrom", "Ångstrom", and "Ångstrom". 
(The first one starts with U+00C5, the second with U+212B, and the third 
with U+0041U+030A, in case that didn't show up on your mail client.) Are 
these strings equal? I would have liked to have said that they look 
identical, but they don't, at leat on this machine, with this mail client 
and this font (Windows NT / Lotus Notes / Lucida Sans Unicode 10 pt, as it 
happens), where they look slightly different. (I think that is a bug in 
the font -- from left to right, the ring is responding to the gravity of 
the situation by slowing rolling off the A.) If anyone other than me can 
see that at all, then I suppose Unicode might be getting somewhere. :)

Well, OK, that is a bit of a cheat because I think they actually turn into 
the same string if you apply any Unicode Normalisation transformation. But 
what about Cyrillic? (Or Greek, for that matter.) Do the identifiers "A", 
"А", and "Α" refer to the same object or not? (That was U+0041, U+410 and 
U+391, respectively.) What is the general case in which this is not a Bad 
Thing? If you are referring to display of text, I would say that was a 
pretty specific case. (Furthermore, there is nothing stopping me from 
composing U+0410 with U+030A: "А̊", which will not normalise to "Å")

It could have been otherwise with a simple rule: 1 glyph == 1 code. It's 
hard to see why you have to compose some but not all diacritics in 
European languages, while more than eleven thousand codes are given over 
to compositions of Hangul.