lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

On Wed, 08 Feb 2012 11:49:47 -0700
Tim Mensch <> wrote:

> On 2/8/2012 11:01 AM, Dirk Laurie wrote:
> > (1) Additional functions in "string" library, e.g. str:usub(3,6) 
> > extracts UTF8 characters 3 to 6 and throws an error if str is not 
> > valid UTF8.  Pro: simplest.  Con: requires a change in 'official'
> > Lua, can't genuinely start mid-string.
> Is there some reason that I'm not getting that we couldn't add
> functions to "string"? Just that it's considered bad form?
> Though it wouldn't be able to add the optimizations I suggested in 
> another message if you didn't modify Lua proper, so no, you can't
> start mid-string.
> > (3) Another standard library, say "utf8", but operating on
> > userdata, e.g. ustr:sub(3,6).  ustr:type() is 'utf8'.  Creates a
> > private code point address list.  Pro: avoids cons of (1) and (2).
> > Con: requires conversion to-from string.
> One of the key advantages of using UTF-8 is that you're just 
> manipulating strings, so not being able to convert trivially is 
> annoying. Obviously it could have a __tostring function, though, so 
> converting in that direction doesn't need to be painful. If only 
> concatenation (..) would run __tostring on a table parameter, we'd be 
> set with the userdata approach. And to convert the other direction, a 
> short function name like "_u" could make that easy:  _u"string to
> make a UTF-8 object out of".
> I pretty much need such a function anyway, since the sane way to do 
> internationalization is to put your strings in a table somewhere
> which gets switched based on your locale, so in the code I'll have:
> _t "English version of the string."
> ...and then elsewhere I'll have a table:
> {
>     ["English version of the string."] = "A translation of the string
> to another language."
> }
> > But your item [2] really kills all of these ideas.  If we can't
> > have ustr:match, we may as well compile Lua with 16-bit Unicode
> > strings if our locale is fundamentally non-ASCII.
> Yuck. I would suggest that 16-bit Unicode was NEVER a good idea. Not 
> even counting combining characters, you can't even fit all of the 
> Unicode code points in 16-bits (over 110,000 now [1]), so some of
> them take two words to store ("surrogate pairs"). This means that you
> can't reliably index a UTF-16 string using offsets, and direct
> indexing of characters is the only argument I've heard in favor of
> UTF-16.
> Aside from that, apart from Windows, the rest of the world seems to
> be moving toward UTF-8 as a standard encoding. (I know that it's more 
> complicated in some countries, but still it seems to be the general
> trend.)
> Making the pattern matching work for UTF-8 strings wouldn't be rocket 
> science. As was pointed out in another message, MOST of the patterns 
> would work MOSTLY as-is. I bet it wouldn't take more than a few minor 
> patches to make a version of match() that would work fine for UTF-8.
> Tim
> [1]

(Regarding pattern matching)

I don't know; Unicode properties make things...interesting.

Attachment: signature.asc
Description: PGP signature