Re: Unicode and UTF-8 the Lua way, mid-discussion (was Re: What do you miss most in Lua)

lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

Subject: Re: Unicode and UTF-8 the Lua way, mid-discussion (was Re: What do you miss most in Lua)
From: Tim Mensch <tim-lua-l@...>
Date: Wed, 08 Feb 2012 11:49:47 -0700

On 2/8/2012 11:01 AM, Dirk Laurie wrote:

(1) Additional functions in "string" library, e.g. str:usub(3,6)extracts UTF8 characters 3 to 6 and throws an error if str is notvalid UTF8. Pro: simplest. Con: requires a change in 'official' Lua,can't genuinely start mid-string.

Is there some reason that I'm not getting that we couldn't add functionsto "string"? Just that it's considered bad form?

Though it wouldn't be able to add the optimizations I suggested inanother message if you didn't modify Lua proper, so no, you can't startmid-string.

(3) Another standard library, say "utf8", but operating on userdata,e.g. ustr:sub(3,6). ustr:type() is 'utf8'. Creates a private codepoint address list. Pro: avoids cons of (1) and (2). Con: requiresconversion to-from string.

One of the key advantages of using UTF-8 is that you're justmanipulating strings, so not being able to convert trivially isannoying. Obviously it could have a __tostring function, though, soconverting in that direction doesn't need to be painful. If onlyconcatenation (..) would run __tostring on a table parameter, we'd beset with the userdata approach. And to convert the other direction, ashort function name like "_u" could make that easy: _u"string to make aUTF-8 object out of".

I pretty much need such a function anyway, since the sane way to dointernationalization is to put your strings in a table somewhere whichgets switched based on your locale, so in the code I'll have:


_t "English version of the string."

...and then elsewhere I'll have a table:

{

["English version of the string."] = "A translation of the string toanother language."

But your item [2] really kills all of these ideas. If we can't haveustr:match, we may as well compile Lua with 16-bit Unicode strings ifour locale is fundamentally non-ASCII.

Yuck. I would suggest that 16-bit Unicode was NEVER a good idea. Noteven counting combining characters, you can't even fit all of theUnicode code points in 16-bits (over 110,000 now [1]), so some of themtake two words to store ("surrogate pairs"). This means that you can'treliably index a UTF-16 string using offsets, and direct indexing ofcharacters is the only argument I've heard in favor of UTF-16.

Aside from that, apart from Windows, the rest of the world seems to bemoving toward UTF-8 as a standard encoding. (I know that it's morecomplicated in some countries, but still it seems to be the general trend.)

Making the pattern matching work for UTF-8 strings wouldn't be rocketscience. As was pointed out in another message, MOST of the patternswould work MOSTLY as-is. I bet it wouldn't take more than a few minorpatches to make a version of match() that would work fine for UTF-8.


Tim

[1] http://en.wikipedia.org/wiki/Unicode

Follow-Ups:
- Re: Unicode and UTF-8 the Lua way, mid-discussion (was Re: What do you miss most in Lua), Miles Bader
- Re: Unicode and UTF-8 the Lua way, mid-discussion (was Re: What do you miss most in Lua), Rob Hoelz

References:
- Unicode and UTF-8 the Lua way, mid-discussion (was Re: What do you miss most in Lua), Jay Carlson
- Re: Unicode and UTF-8 the Lua way, mid-discussion (was Re: What do you miss most in Lua), Dirk Laurie

Prev by Date: Re: What do you miss most in Lua
Next by Date: Re: What do you miss most in Lua
Previous by thread: Re: Unicode and UTF-8 the Lua way, mid-discussion (was Re: What do you miss most in Lua)
Next by thread: Re: Unicode and UTF-8 the Lua way, mid-discussion (was Re: What do you miss most in Lua)
Index(es):
- Date
- Thread