Re: Unicode and UTF-8 the Lua way, mid-discussion (was Re: What do you miss most in Lua)

lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

Subject: Re: Unicode and UTF-8 the Lua way, mid-discussion (was Re: What do you miss most in Lua)
From: Rob Hoelz <rob@...>
Date: Wed, 8 Feb 2012 12:18:21 -0600

On Wed, 8 Feb 2012 20:01:12 +0200
Dirk Laurie <dirk.laurie@gmail.com> wrote:

> Op 8 februari 2012 17:18 schreef Jay Carlson <nop@nop.com> het
> volgende:
> 
> >
> > [1]: Why yes, if UTF-8 processing is how we do Unicode processing,
> > and we don't have the character property tables, we've reduced this
> > to a trivial case of the whole "strings have types; will your
> > language help you?" question. It's just a very simple language.
> >
> > [2]: Patterns look very difficult to fix up on the Lua side though.
> >
> > I think we are all agreed that some sort of UTF8 support in Lua is
> desirable if not essential.  The question is: how?
> 
> (1) Additional functions in "string" library, e.g. str:usub(3,6)
> extracts UTF8 characters 3 to 6 and throws an error if str is not
> valid UTF8.  Pro: simplest.  Con: requires a change in 'official'
> Lua, can't genuinely start mid-string.
> (2) Another standard library, say "ustring", with functions like
> "string" but UTF8 semantics, say ustring.sub(str,3,6).  Pro: can be
> implemented as a third-party library with no change to 'official'
> Lua.  Con: like (1), also no object oriented calls.
> (3) Another standard library, say "utf8", but operating on userdata,
> e.g. ustr:sub(3,6).  ustr:type() is 'utf8'.  Creates a private code
> point address list.  Pro: avoids cons of (1) and (2).  Con: requires
> conversion to-from string.
> 
> But your item [2] really kills all of these ideas.  If we can't have
> ustr:match, we may as well compile Lua with 16-bit Unicode strings if
> our locale is fundamentally non-ASCII.

I can actually think of a fourth solution, although it will be probably
be received as rather hack-ish.  You can add two new functions,
string.upgrade and string.downgrade, which marks a string as UTF-8 or
"non-UTF-8".  You then replace the standard string methods with
versions that run either a UTF-8 aware implementation or the standard
implementation, depending on whether or not that string has been
"upgraded".  This allows for legacy code to work (since you'd have to
actually upgrade() a string to use UTF-8 operations), allows the work
to be done in an external library, and allows the use of
str:methodname(), at the cost of a small performance hit and a little
uncleanliness.

-Rob

Attachment: signature.asc
Description: PGP signature

Follow-Ups:
- Re: Unicode and UTF-8 the Lua way, mid-discussion (was Re: What do you miss most in Lua), Patrick Rapin
- Re: Unicode and UTF-8 the Lua way, mid-discussion (was Re: What do you miss most in Lua), Sam Roberts

References:
- Unicode and UTF-8 the Lua way, mid-discussion (was Re: What do you miss most in Lua), Jay Carlson
- Re: Unicode and UTF-8 the Lua way, mid-discussion (was Re: What do you miss most in Lua), Dirk Laurie

Prev by Date: Re: Unicode and UTF-8 the Lua way, mid-discussion (was Re: What do you miss most in Lua)
Next by Date: Re: Lua buildpack for Heroku
Previous by thread: Re: Unicode and UTF-8 the Lua way, mid-discussion (was Re: What do you miss most in Lua)
Next by thread: Re: Unicode and UTF-8 the Lua way, mid-discussion (was Re: What do you miss most in Lua)
Index(es):
- Date
- Thread