Re: Unicode and UTF-8 the Lua way, mid-discussion (was Re: What do you miss most in Lua)

lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

Subject: Re: Unicode and UTF-8 the Lua way, mid-discussion (was Re: What do you miss most in Lua)
From: Jay Carlson <nop@...>
Date: Thu, 9 Feb 2012 14:29:26 +0000

On Wed, Feb 8, 2012 at 7:10 PM, Sam Roberts <vieuxtech@gmail.com> wrote:

> I'm slightly baffled as to why this long conversation about unicode
> support in lua doesn't seem to acknowledge that the features requested
> already exist, AFAICT, see
> icu4lua and slnunicode at end of http://lua-users.org/wiki/LuaUnicode

No, the features I'm talking about don't exist for several reasons,
starting from first not wanting a feature but a metamechanism where
features could go. Stipulate everybody in this discussion could
produce a reasonably competent userdata-based libicu binding. No,
icu4lua is not trivial, but it is straightforward.

I had in fact missed slnunicode, which implements most of string.*
over byte-like strings in ASCII, ISO 8859-1, and UTF-8 with or without
grapheme match. It ignores a lot of the hard stuff, which is good.
Eyeballing it, size is hardly a complaint. On i386 Linux the object
file is 27k (the .so bloats out to 32k). It uses the compact Unicode
table from Tcl; although it's missing a bunch of stuff, it makes me
look a little foolish when talking about a "big table":

$ nm --radix d --print-size unicode.so  | perl -ln -e 'if (/ (\d+) r
/) { print; $t += $1; } END { print "TOTAL $t"; }'

00019900 00007596 r groupMap
00027496 00000508 r groups
00014012 00005886 r pageMap
TOTAL 13990

In the context of this discussion, slnunicode's functionality is
*overkill*, although it would fit nicely as an augment.

It doesn't help you in the task of producing valid text or debugging
said failure, especially when spanning module/author boundaries.

> . One or both of those libraries seem to support most (all?) of what
> has been identified as "needs" in the multiple times this topic has
> been beaten to death on lua-l.

I disagree with your "assessment", but thanks for the dismissive drive-by.

> Getting lua's core to change its view of strings to being something
> other than a byte-sequence isn't going to happen, its not the lua way,

...which is why I'm not proposing that.[1]

> and its caused big problems for languages that have tried it (and not
> just code bloat), see http://lwn.net/Articles/478486/ [*].

Gosh, nobody with an interest in i18n would have any experience
working with bytes and text in other languages. (Hint: it's right up
there in the subject line: "what do you MISS most in Lua".)

I don't think anybody is proposing Python2 or Python3's approach. I
take that back; it may have come up because we are, you know,
*discussing* what can be done in a way that fits the rest of Lua. I
think Python's approach doesn't; besides, this is icu4lua's approach
as well except it can't be as well-integrated.

In my opinion, languages which treat text and sequences of bytes as
equivalent are going to look increasingly like relics of the 20th
Century. People want their programs to process text; this is why we
have powerful tools designed primarily for ASCII strings around,
although in practice often they don't touch text other than
delimiters. It was an easy jump for most of those 7-bit ASCII tools to
then support high-bit single-byte encodings,

Wait. It wasn't. How long did it take for mail to get 8-bit clean? I
believe I saw regexp implementation using the 128-159 character range
as internal markers. In any case extended ASCII although they tended
to be second-class citizens. Extended ASCII approach doesn't work at
all for writing systems with more than 256 glyphs.

> lua's
> approach that strings are binary bytes, and you can decode them using
> a 3rd party library into a unicode/other-encoding aware
> representation/library seems the right thing to do.

Lua's current approach strongly implies that the only relevant
operations on text are single-byte ones. Had the language been
designed where CJK languages were dominant instead of Portuguese,
would lstrlib.c look the same? What would the PiL string sections look
like?

> Getting a new library into the lua core is unlikely, but could happen.
> bit32 would be the best model - when pretty much everybody was
> actually including a bit library in their project, and there was wide
> agreement that it was useful, it finally made it into lua.

I have used bitlib I think twice--not everybody was using it. So that
explanation of how bit32 ended up in the distribution does not quite
seem to be right.

I'm not proposing a library, I'm first groping around for a
metamechanism to assist text processing in non-byte locales using
strings, and one that doesn't break the rest of the language or
require byte-locale people to eat a bunch of complexity. I don't have
a library, or rather I have far too many, and none of them fit right.
I want a *coordination* mechanism, not a specification.

> So, if
> there was some non-binary string support library that pretty much
> everybody used, and found useful, it might make it into lua 5.9, or
> something, but in the meantime, if unicode is so critical, and lua's
> library doesn't support it - what are people doing? Ignoring it? Well,
> then it ain't critical. Using some external library? Well, then its
> also not critical, since support exists. Kind of a catch-22,
> really....

Clearly we live in the best of all possible worlds.

Jay

[1]: Actually, I've proposed pointing _G.string at text-centric
functions in a UTF-8 mode as a "what-if", but I don't think you're
paying attention enough to argue with this statement.

Follow-Ups:
- Re: Unicode and UTF-8 the Lua way, mid-discussion (was Re: What do you miss most in Lua), HyperHacker
- Re: Unicode and UTF-8 the Lua way, mid-discussion (was Re: What do you miss most in Lua), Sam Roberts

References:
- Unicode and UTF-8 the Lua way, mid-discussion (was Re: What do you miss most in Lua), Jay Carlson
- Re: Unicode and UTF-8 the Lua way, mid-discussion (was Re: What do you miss most in Lua), Dirk Laurie
- Re: Unicode and UTF-8 the Lua way, mid-discussion (was Re: What do you miss most in Lua), Rob Hoelz
- Re: Unicode and UTF-8 the Lua way, mid-discussion (was Re: What do you miss most in Lua), Sam Roberts

Prev by Date: Re: How to follow 80 Column format in Lua
Next by Date: Re: Unicode and UTF-8 the Lua way, mid-discussion (was Re: What do you miss most in Lua)
Previous by thread: Re: Unicode and UTF-8 the Lua way, mid-discussion (was Re: What do you miss most in Lua)
Next by thread: Re: Unicode and UTF-8 the Lua way, mid-discussion (was Re: What do you miss most in Lua)
Index(es):
- Date
- Thread