[Date Prev][Date Next][Thread Prev][Thread Next]
- Subject: Re: Will Lua kernel use Unicode in the future?
- From: Jens Alfke <jens@...>
- Date: Fri, 30 Dec 2005 10:32:05 -0800
(I've replied to a bunch of messages here rather than sending out six
On 29 Dec '05, at 9:17 AM, Chris Marrin wrote:
It allows you to add "incidental" characters without the need for a
fully functional editor for that language. For instance, when I
worked for Sony we had the need to add a few characters of Kanji on
occasion. It's not easy to get a Kanji editor setup for a western
keyboard, so adding direct unicode was more convenient. There are
also some oddball symbols in the upper registers for math and
chemistry and such that are easier to add using escapes.
Also, in some projects there are guidelines that discourage the use
of non-ascii characters in source files (due to problems with
editors, source control systems, or other tools. In these situations
it's convenient to be able to use inline escapes to specify non-ascii
characters that commonly occur in human-readable text ... examples
would include ellipses, curly-quotes, emdashes, bullets, currency
symbols, as well as accented letters of course.
IMO, with globalization, languages that don't support Unicode won't
make the cut in the long run.
I find it ironic that the three non-Unicode-savvy languages I use
(PHP, Ruby, Lua) all come from countries whose native languages use
non-ascii characters :)
I'm keeping an eye on this thread because, if I end up using Lua for
any actual work projects, I18N is mandatory and I can't afford to run
into any walls when it comes time to make things work in Japanese or
Thai or Arabic. (Not like last time...See below for the problems I've
I've done an awful lot of work with international character sets,
and I personally consider any encoding of Unicode other than UTF-8
to be obsolete. Looking at most other modern (i.e. not held back by
backwards compatibility, e.g. Windows) users of Unicode, their
authors appear to feel similarly.
Yes, at least as an external representation. Internally, it can be
convenient to use a wide representation for speed of indexing into
strings, but a good string library should hide that from the client
as an implementation detail. (Disclaimer: I'm mostly knowledgeable
only about Java's and Mac OS X's string libraries.)
Most apps can just treat strings as opaque byte streams.
I agree, mostly; the one area I've run into problems has been with
regexp/pattern libraries. Any pattern that relies on "alphanumeric
characters" or "word boundaries" assumes a fair bit of Unicode
knowledge behind the scenes. I ran into this problem when
implementing a live search field in Safari RSS, since while JS
strings are Unicode-savvy, many implementations' regexp
implementations aren't, so character classes like "\w" or "\b" only
work on ascii alphanumerics.
These kinds of problems should be solved at a different level, not
hacked into Lua. The beautiful thing about Lua is that it's really
clean ANSI C code
...except for the parts that aren't, like library loading, and the
extension libraries for sockets, databases, etc. I agree that having
a portable core runtime is important, but there should be some kind
of standard extension for Unicode strings, hopefully one that cleanly
extends the built-in string objects using something like ICU.