lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


> > But two identical utf-8 characters can have different 
> encoding, right?
> 
> No. I mean, if they have the same unicode number, they must 
> have the same utf-8 encoding.

Well, it's worse than that.
In languages such as Hindu and Arabic you have ligatures,
collapses of sequential chars into one, like, e.g. some Latin
books print "fi" or "ff" as a single uninterrupted character.

So, to _really_ support text-processing applications in these
languages, you need to know the ligature composition rules
and tables.

But IMHO this is something best left to the application,
and not attempted at the language level. So string comparisons
of UTF-8 strings _are_ valid string comparisons of the Unicode
strings represent.