lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


On Feb 6, 2012 3:14 AM, "Miles Bader" <miles@gnu.org> wrote:
>
> Jay Carlson <nop@nop.com> writes:
> > I imagine it already has in non-byte locales. A Korean Lua program will not
> > use libraries from a Greek one. Unless everybody is already using
> > UTF-8--with no sanity checking.
>
> Of course everybody's just using UTF-8 with no sanity checking...

Got some Koreans to back you up on that? EUC-KR (realistically, CP949)
still lives; see the <head> of http://chosun.com and donga.com.
naver.com is UTF-8 (not a total surprise). Perhaps all the Korean text
processing *in Lua* is being done in unchecked UTF-8, but I kinda
doubt it. I have no idea what the state of ISO 8859-7 vs UTF-8 is in
Greece (but as you're aware, UTF-8 doubles the size of Greek text.
Good thing there's so much ASCII in HTML.)

Note that Ruby 1.9 decided it wanted to keep SJIS more than it wanted
simplicity--unlike everybody else, internationalized strings are kept
around as a bag of octets tagged with their encoding. I get tired just
thinking about writing and maintaining that.

> [_real_ "unicode support" is hideously expensive (the size of those
> tables...),

Right. Many systems have already paid the price though. For example,
Linux machines with a UI have glib in core already.

No, it's not free to link it. Using bad methodology: on a Celeron
E1200 running x86_64 10.04 Ubuntu, strace shows referencing
g_unichar_combining_class() adds an 8ms of *startup* latency above
over bare lua's 4ms. But counting this way Python 2.6 takes 68ms from
start of dynamic linking to first line of execution, so people who
need it don't have to think twice on that account.

> and it's really kind of hard to say how you could get
> anything like it into Lua without completely ruining its "Luaness"...]

One big lesson from Lua is that you don't need to standardize on
mechanism, you just need some rough consensus on minimum syntactic
interface. Everybody has their own object system but people seem
pretty happy if all they know is o:m() and perhaps Constructor{}
syntax.

I don't think the standard build would necessarily have to do anything
besides ISO-8859-x, but an text.isvalid(s) would still be useful if it
failed on the >= 127 !isprint(), and the (iscntrl && !isspace)
characters.

A replacement could look for well-formed UTF-8. That's easy to write
in standalone C, and at least keeps some kinds of bugs out. Notably it
will blow up if you feed it binary data or instead of text.

If another implementation did pay for those Unicode tables, it could
also check for valid Unicode, or Your Favorite Normal Form, or
whatever. lua.org doesn't have to ship anything complicated and
shouldn't.

If you're wondering why I'm obsessed with the issue of coordination
through core or -llualib, it's because I can go build whatever
bindings I want on my own. I can metaprogram all kinds of elaborate
stuff. I can build private syntax. I don't need to discuss that with
lua-l. I've been in the bubble long enough that I *like* just about
everything....

Jay