lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


On Sat, Apr 19, 2014 at 6:20 PM, Coroutines <coroutines@gmail.com> wrote:
A lot of problems arise from overlong encodings, where something like
'a' can be encoded over 3 bits (if you choose).  Per the standard, it
could be considered an invalid encoding, so it's the developers' job
to normalize these down to the least amount of bytes needed.  Other
issues arise from interchanging diacritic marks to visually represent
the same character.  In those cases I feel like using Optical
Character Recognition (OCR) is best.  You make an image of the
rendered character and compare it to the encodings you exist, etc...
I'm obviously no expert. :>  You could also just sort the diacritic
marks and other crazy things in unicode from least-to-greatest order,
so there would only be 1 form of that character with those marks, etc.

normalize normalize normalize...  fun buzzwords :-)

PS: It gets to be more fun when in an application 2 separate
characters are treated as equivalent.  Like / for division and the
other math / being the same.


(Straying from the topic of normalization a little...)

It seems like the general consensus is "Lua can't support Unicode, because the lookup tables alone are bigger than all of Lua." But does Lua need to include those tables? Are they not provided by the OS? I suppose C89 probably doesn't include them, but if we're working on a platform that doesn't support C99[1], we're probably working on some small embedded system, where we won't be using Unicode anyway. A Unicode library could easily be made conditional just like os.popen, and just provide an interface to Unicode libraries provided by the OS.

[1] exception of course being Windows. I'm not sure how Windows is dealt with in the Lua world, but I'm imagining three makefiles or targets: one for Windows, one for *nix-likes with Unicode (where you can provide alternate paths to whatever libraries if they're not in a standard place), one for generic without Unicode.

--
Sent from my Game Boy.