[Date Prev][Date Next][Thread Prev][Thread Next]
- Subject: Re: Unicode and UTF-8 the Lua way, mid-discussion (was Re: What do you miss most in Lua)
- From: Sam Roberts <vieuxtech@...>
- Date: Wed, 8 Feb 2012 11:10:09 -0800
I'm slightly baffled as to why this long conversation about unicode
support in lua doesn't seem to acknowledge that the features requested
already exist, AFAICT, see
icu4lua and slnunicode at end of http://lua-users.org/wiki/LuaUnicode
. One or both of those libraries seem to support most (all?) of what
has been identified as "needs" in the multiple times this topic has
been beaten to death on lua-l.
Getting lua's core to change its view of strings to being something
other than a byte-sequence isn't going to happen, its not the lua way,
and its caused big problems for languages that have tried it (and not
just code bloat), see http://lwn.net/Articles/478486/ [*]. lua's
approach that strings are binary bytes, and you can decode them using
a 3rd party library into a unicode/other-encoding aware
representation/library seems the right thing to do.
Getting a new library into the lua core is unlikely, but could happen.
bit32 would be the best model - when pretty much everybody was
actually including a bit library in their project, and there was wide
agreement that it was useful, it finally made it into lua. So, if
there was some non-binary string support library that pretty much
everybody used, and found useful, it might make it into lua 5.9, or
something, but in the meantime, if unicode is so critical, and lua's
library doesn't support it - what are people doing? Ignoring it? Well,
then it ain't critical. Using some external library? Well, then its
also not critical, since support exists. Kind of a catch-22,
LWN's site code is written in Python 2. Version 2.x of the language is
entirely able to handle Unicode, especially for relatively large
values of x. To that end, it has a unicode string type, but this type
is clearly a retrofit. It is not used by default when dealing with
strings; even literal strings must be marked explicitly as Unicode, or
they are just plain strings.
When Unicode was added to Python 2, the developers tried very hard to
make it Just Work. Any sort of mixture between Unicode and "plain
strings" involves an automatic promotion of those strings to Unicode.
It is a nice idea, in that it allows the programmer to avoid thinking
about whether a given string is Unicode or "just a string." But if the
programmer does not know what is in a string - including its encoding
- nobody does. The resulting confusion can lead to corrupted text or
Python exceptions; as Guido van Rossum put it in the introduction to
Python 3, "This value-specific behavior has caused numerous sad faces
over the years." Your editor's experience, involving a few sad faces
for sure, agrees with this; trying to make strings "just work" leads
to code containing booby traps that may not spring until some truly
inopportune time far in the future.
That is why Python 3 changed the rules. There are no "strings" anymore
in the language; instead, one works with either Unicode text or binary
bytes. As a general rule, data coming into a program from a file,
socket, or other source is binary bytes; if the program needs to
operate on that data as text, it must explicitly decode it into
Unicode. This requirement is, frankly, a pain; there is a lot of
explicit encoding and decoding to be done that didn't have to happen
in a Python 2 program. But experience says that it is the only
rational way; otherwise the program (and programmer) never really know
what is in a given string.