lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


A lot of problems arise from overlong encodings, where something like
'a' can be encoded over 3 bits (if you choose).  Per the standard, it
could be considered an invalid encoding, so it's the developers' job
to normalize these down to the least amount of bytes needed.  Other
issues arise from interchanging diacritic marks to visually represent
the same character.  In those cases I feel like using Optical
Character Recognition (OCR) is best.  You make an image of the
rendered character and compare it to the encodings you exist, etc...
I'm obviously no expert. :>  You could also just sort the diacritic
marks and other crazy things in unicode from least-to-greatest order,
so there would only be 1 form of that character with those marks, etc.

normalize normalize normalize...  fun buzzwords :-)

PS: It gets to be more fun when in an application 2 separate
characters are treated as equivalent.  Like / for division and the
other math / being the same.