Re: Unicode and UTF-8 the Lua way, mid-discussion (was Re: What do you miss most in Lua)

lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

Subject: Re: Unicode and UTF-8 the Lua way, mid-discussion (was Re: What do you miss most in Lua)
From: Jay Carlson <nop@...>
Date: Wed, 8 Feb 2012 16:01:20 -0500

On Wed, Feb 8, 2012 at 1:01 PM, Dirk Laurie <dirk.laurie@gmail.com> wrote:
> Op 8 februari 2012 17:18 schreef Jay Carlson <nop@nop.com> het volgende:
>
>>
>> [1]: Why yes, if UTF-8 processing is how we do Unicode processing, and
>> we don't have the character property tables, we've reduced this to a
>> trivial case of the whole "strings have types; will your language help
>> you?" question. It's just a very simple language.
>>
>> [2]: Patterns look very difficult to fix up on the Lua side though.

Let me clarify: if we're really trying to write as little C code as
possible, we could write:

function string.sub(s, i, j)
  return assert_utf8(bytestring.sub(assert_utf8(s), i, j))
end

function string.rep(s, n)
  -- No need to check the return in this case
  return bytestring.rep(assert_utf(s), n)
end

...and then the examples get more complicated. Given little
cooperation from the C code, we can implement much string.* in terms
of bytestring.* in the language Lua itself.

My caveat was that I saw no easy way in Lua to implement pattern
matching without the C code making "." work in terms of UTF-8 code
points. There's clearly a subset of bytestring.match patterns
guaranteed to only return strings which will pass assert_utf8; one
example is "a(.*)b" There's a set guaranteed to error out on some
inputs; one example is "Я(.)" like when matching "ЯЩ" because the
capture gets a singleton high byte which then flunks assert_utf8().

> I think we are all agreed that some sort of UTF8 support in Lua is desirable
> if not essential.

I would bet a lot of Latin1 people see no compelling need. Upthread
there's a pretty strong opinion that Lua string handling will never
change in any way.

> The question is: how?

Yeah. The art is going to be coming up with something simple enough
that Latin1 people don't eat a lot of complexity.

> (1) Additional functions in "string" library, e.g. str:usub(3,6) extracts
> UTF8 characters 3 to 6 and throws an error if str is not valid UTF8.  Pro:
> simplest.  Con: requires a change in 'official' Lua, can't genuinely start
> mid-string.
> (2) Another standard library, say "ustring", with functions like "string"
> but UTF8 semantics, say ustring.sub(str,3,6).  Pro: can be implemented as a
> third-party library with no change to 'official' Lua.  Con: like (1), also
> no object oriented calls.
> (3) Another standard library, say "utf8", but operating on userdata, e.g.
> ustr:sub(3,6).  ustr:type() is 'utf8'.  Creates a private code point address
> list.  Pro: avoids cons of (1) and (2).  Con: requires conversion to-from
> string.

What are the values 3 and 6?

The cheap version is that those do not actually count code points;
they are an opaque (I've been calling it "UINDEX") numeric type, which
just happens to be byte indexes. But you can't do arithmetic on them;
adding 1 to your 3 does not necessarily give a valid UINDEX on str.
The intellectual justification for this that Miles and I are advancing
is we're already doing almost everything in loops over pattern
matches, and in that model the index values are already opaque, if not
discarded. But because people are going to do math on them
(accidentally or out of ignorance), the substring extractor has to
blow up on mid-sequence indexes to preserve everything nice we get in
closure over UTF-8.

My personal taste is something like perl's "-C" flag to turn on the
whole machinery including stdio enforcement, and in any case for
_G.string to either be deprecated or default to text-like rather than
byte-like operations. The single-byte character constituency could
just union the text and byte operations together since they mostly
coincide, but new code specifying text vs byte intent even in
single-byte locales may be usable in UTF-8. I'm not counting on it,
mind you. The only guarantee my proposal has is that invalid UTF-8
will not show up--it could still be garbage because of mismanipulation
of byte indexing.

Adding a type to strings (and UINDEX values!) *could* keep people from
doing that. I'm not sold on it, as it might better be done
aftermarket. I have a faint hope that such an addon would not actually
change the interfaces.

> But your item [2] really kills all of these ideas.  If we can't have
> ustr:match, we may as well compile Lua with 16-bit Unicode strings if our
> locale is fundamentally non-ASCII.

There may be acceptable restrictions which could be made on patterns
such that they never return invalid UTF-8 when fed valid UTF-8, or
behave distatefully. One technique bans repetition on non-ASCII
literals--because "Я+" interpreted as by a byte pattern engine looks
like "\xD0\0xAF+", and you can't have "\xD0\0xAF\0xAF" anywhere in
valid UTF-8. That's probably not what the programmer intended.

Jay

References:
- Unicode and UTF-8 the Lua way, mid-discussion (was Re: What do you miss most in Lua), Jay Carlson
- Re: Unicode and UTF-8 the Lua way, mid-discussion (was Re: What do you miss most in Lua), Dirk Laurie

Prev by Date: Re: Unicode and UTF-8 the Lua way, mid-discussion (was Re: What do you miss most in Lua)
Next by Date: Re: LuaSoap and argument containting an '&'
Previous by thread: Re: Unicode and UTF-8 the Lua way, mid-discussion (was Re: What do you miss most in Lua)
Next by thread: Re: LuaSoap and argument containting an '&'
Index(es):
- Date
- Thread