Re: Unicode and UTF-8 the Lua way, mid-discussion (was Re: What do you miss most in Lua)

lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

Subject: Re: Unicode and UTF-8 the Lua way, mid-discussion (was Re: What do you miss most in Lua)
From: HyperHacker <hyperhacker@...>
Date: Thu, 9 Feb 2012 07:37:40 -0700

On Thu, Feb 9, 2012 at 07:29, Jay Carlson <nop@nop.com> wrote:
> On Wed, Feb 8, 2012 at 7:10 PM, Sam Roberts <vieuxtech@gmail.com> wrote:
>
>> I'm slightly baffled as to why this long conversation about unicode
>> support in lua doesn't seem to acknowledge that the features requested
>> already exist, AFAICT, see
>> icu4lua and slnunicode at end of http://lua-users.org/wiki/LuaUnicode
>
> No, the features I'm talking about don't exist for several reasons,
> starting from first not wanting a feature but a metamechanism where
> features could go. Stipulate everybody in this discussion could
> produce a reasonably competent userdata-based libicu binding. No,
> icu4lua is not trivial, but it is straightforward.
>
> I had in fact missed slnunicode, which implements most of string.*
> over byte-like strings in ASCII, ISO 8859-1, and UTF-8 with or without
> grapheme match. It ignores a lot of the hard stuff, which is good.
> Eyeballing it, size is hardly a complaint. On i386 Linux the object
> file is 27k (the .so bloats out to 32k). It uses the compact Unicode
> table from Tcl; although it's missing a bunch of stuff, it makes me
> look a little foolish when talking about a "big table":
>
> $ nm --radix d --print-size unicode.so  | perl -ln -e 'if (/ (\d+) r
> /) { print; $t += $1; } END { print "TOTAL $t"; }'
>
> 00019900 00007596 r groupMap
> 00027496 00000508 r groups
> 00014012 00005886 r pageMap
> TOTAL 13990
>
> In the context of this discussion, slnunicode's functionality is
> *overkill*, although it would fit nicely as an augment.
>
> It doesn't help you in the task of producing valid text or debugging
> said failure, especially when spanning module/author boundaries.
>
>> . One or both of those libraries seem to support most (all?) of what
>> has been identified as "needs" in the multiple times this topic has
>> been beaten to death on lua-l.
>
> I disagree with your "assessment", but thanks for the dismissive drive-by.
>
>> Getting lua's core to change its view of strings to being something
>> other than a byte-sequence isn't going to happen, its not the lua way,
>
> ...which is why I'm not proposing that.[1]
>
>> and its caused big problems for languages that have tried it (and not
>> just code bloat), see http://lwn.net/Articles/478486/ [*].
>
> Gosh, nobody with an interest in i18n would have any experience
> working with bytes and text in other languages. (Hint: it's right up
> there in the subject line: "what do you MISS most in Lua".)
>
> I don't think anybody is proposing Python2 or Python3's approach. I
> take that back; it may have come up because we are, you know,
> *discussing* what can be done in a way that fits the rest of Lua. I
> think Python's approach doesn't; besides, this is icu4lua's approach
> as well except it can't be as well-integrated.
>
> In my opinion, languages which treat text and sequences of bytes as
> equivalent are going to look increasingly like relics of the 20th
> Century. People want their programs to process text; this is why we
> have powerful tools designed primarily for ASCII strings around,
> although in practice often they don't touch text other than
> delimiters. It was an easy jump for most of those 7-bit ASCII tools to
> then support high-bit single-byte encodings,
>
> Wait. It wasn't. How long did it take for mail to get 8-bit clean? I
> believe I saw regexp implementation using the 128-159 character range
> as internal markers. In any case extended ASCII although they tended
> to be second-class citizens. Extended ASCII approach doesn't work at
> all for writing systems with more than 256 glyphs.
>
>> lua's
>> approach that strings are binary bytes, and you can decode them using
>> a 3rd party library into a unicode/other-encoding aware
>> representation/library seems the right thing to do.
>
> Lua's current approach strongly implies that the only relevant
> operations on text are single-byte ones. Had the language been
> designed where CJK languages were dominant instead of Portuguese,
> would lstrlib.c look the same? What would the PiL string sections look
> like?
>
>> Getting a new library into the lua core is unlikely, but could happen.
>> bit32 would be the best model - when pretty much everybody was
>> actually including a bit library in their project, and there was wide
>> agreement that it was useful, it finally made it into lua.
>
> I have used bitlib I think twice--not everybody was using it. So that
> explanation of how bit32 ended up in the distribution does not quite
> seem to be right.
>
> I'm not proposing a library, I'm first groping around for a
> metamechanism to assist text processing in non-byte locales using
> strings, and one that doesn't break the rest of the language or
> require byte-locale people to eat a bunch of complexity. I don't have
> a library, or rather I have far too many, and none of them fit right.
> I want a *coordination* mechanism, not a specification.
>
>> So, if
>> there was some non-binary string support library that pretty much
>> everybody used, and found useful, it might make it into lua 5.9, or
>> something, but in the meantime, if unicode is so critical, and lua's
>> library doesn't support it - what are people doing? Ignoring it? Well,
>> then it ain't critical. Using some external library? Well, then its
>> also not critical, since support exists. Kind of a catch-22,
>> really....
>
> Clearly we live in the best of all possible worlds.
>
> Jay
>
> [1]: Actually, I've proposed pointing _G.string at text-centric
> functions in a UTF-8 mode as a "what-if", but I don't think you're
> paying attention enough to argue with this statement.
>

I'm not sure where the idea of modifying Lua came from. I should note
that the original suggestion was a library that deals with UTF-8, i.e.
a module separate from Lua itself. The idea being "someone should make
such a module so that there's a good, standard implementation we can
all use and contribute to if we need it, instead of everyone writing
their own."
I suppose I overestimated just how often people actually need such a
library in the first place.

-- 
Sent from my toaster.

References:
- Unicode and UTF-8 the Lua way, mid-discussion (was Re: What do you miss most in Lua), Jay Carlson
- Re: Unicode and UTF-8 the Lua way, mid-discussion (was Re: What do you miss most in Lua), Dirk Laurie
- Re: Unicode and UTF-8 the Lua way, mid-discussion (was Re: What do you miss most in Lua), Rob Hoelz
- Re: Unicode and UTF-8 the Lua way, mid-discussion (was Re: What do you miss most in Lua), Sam Roberts
- Re: Unicode and UTF-8 the Lua way, mid-discussion (was Re: What do you miss most in Lua), Jay Carlson

Prev by Date: Re: Unicode and UTF-8 the Lua way, mid-discussion (was Re: What do you miss most in Lua)
Next by Date: Re: How to follow 80 Column format in Lua
Previous by thread: Re: Unicode and UTF-8 the Lua way, mid-discussion (was Re: What do you miss most in Lua)
Next by thread: Re: Unicode and UTF-8 the Lua way, mid-discussion (was Re: What do you miss most in Lua)
Index(es):
- Date
- Thread