[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Re: Unicode and UTF-8 the Lua way, mid-discussion (was Re: What do you miss most in Lua)
- From: Rob Hoelz <rob@...>
- Date: Wed, 8 Feb 2012 12:18:21 -0600
On Wed, 8 Feb 2012 20:01:12 +0200
Dirk Laurie <dirk.laurie@gmail.com> wrote:
> Op 8 februari 2012 17:18 schreef Jay Carlson <nop@nop.com> het
> volgende:
>
> >
> > [1]: Why yes, if UTF-8 processing is how we do Unicode processing,
> > and we don't have the character property tables, we've reduced this
> > to a trivial case of the whole "strings have types; will your
> > language help you?" question. It's just a very simple language.
> >
> > [2]: Patterns look very difficult to fix up on the Lua side though.
> >
> > I think we are all agreed that some sort of UTF8 support in Lua is
> desirable if not essential. The question is: how?
>
> (1) Additional functions in "string" library, e.g. str:usub(3,6)
> extracts UTF8 characters 3 to 6 and throws an error if str is not
> valid UTF8. Pro: simplest. Con: requires a change in 'official'
> Lua, can't genuinely start mid-string.
> (2) Another standard library, say "ustring", with functions like
> "string" but UTF8 semantics, say ustring.sub(str,3,6). Pro: can be
> implemented as a third-party library with no change to 'official'
> Lua. Con: like (1), also no object oriented calls.
> (3) Another standard library, say "utf8", but operating on userdata,
> e.g. ustr:sub(3,6). ustr:type() is 'utf8'. Creates a private code
> point address list. Pro: avoids cons of (1) and (2). Con: requires
> conversion to-from string.
>
> But your item [2] really kills all of these ideas. If we can't have
> ustr:match, we may as well compile Lua with 16-bit Unicode strings if
> our locale is fundamentally non-ASCII.
I can actually think of a fourth solution, although it will be probably
be received as rather hack-ish. You can add two new functions,
string.upgrade and string.downgrade, which marks a string as UTF-8 or
"non-UTF-8". You then replace the standard string methods with
versions that run either a UTF-8 aware implementation or the standard
implementation, depending on whether or not that string has been
"upgraded". This allows for legacy code to work (since you'd have to
actually upgrade() a string to use UTF-8 operations), allows the work
to be done in an external library, and allows the use of
str:methodname(), at the cost of a small performance hit and a little
uncleanliness.
-Rob
Attachment:
signature.asc
Description: PGP signature