lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


On Fri, Apr 1, 2011 at 11:28 AM, Roberto Ierusalimschy
<roberto@inf.puc-rio.br> wrote:
>> Roberto Ierusalimschy <roberto@inf.puc-rio.br> wrote:
>>
>> > A quick survey, for those who care:
>> > - should LPeg support utf-8?
>> > - If so, what would that mean?
>>
>> An alternative to lpeg.P(N) which matches N UTF-8 encoded code points
>> instead of octets. Similarly, alternatives to lpeg.R and lpeg.S that deal
>> with code points instead of octets. Maybe lpeg.uP and .uS and .uR ?
>> Perhaps there should be a .uB as well. I would prefer this to a "unicode
>> mode" which changes the behaviour of the existing funcctions.
>
> This is more ore less what I had in mind (specific names not
> withstanding). But still remains the question of whether each of these
> constructions (uS, uR, etc.) is really useful and whether there should
> be others. For instance, would it be worth to support something like
> properties (using wctype)? Or a capture that matches one code point
> and catures its value?
>
> -- Roberto
>


Right. Lua and LPeG do what is expected with UTF-8, which is to say
that they don't break the naive support that is the defining feature
of UTF-8. So UTF-8 support isn't as much a goal as doing Unicode
support in a platform independent way that includes those platforms
whose system encoding happens to be UTF-8. Since that will break
applications expecting the naive byte stream approach, Unicode
equivalents of the current functions are the right approach.

Since LPeG is concerned with semantics rather than presentation, the
code point is the right unit for captures and counting. Given the Lua
implementation of numbers as floats, using UCS-4 for the internal
representation is probably "the Lua thing to do"... and it would save
Lua-l from the "should we migrate from UCS-2 to UCS-4" discussion at
some point in the future. ;)

The best approach for the community would probably be a Unicode
library that provides a string API, imports basic UTF character sets,
and provides classes that store and manipulate Unicode text as either
UCS-2 or UCS-4. Java, Python and .NET manage to get along just fine
with UCS-2, which uses half the memory of UCS-4 for most applications
but involves a bit of ugly for character sets that go outside the
Basic Multilingual Plane. That's implementation detail, though. This
year memory is cheap and we aren't concerned with Unicode on too many
16 bit processors so my preference leans towards UCS-4, but I think it
mostly comes down to how the implementor wants to handle characters
outside the BMP. Either way, if LPeG Unicode functions relied on a
Unicode string API then that API would likely be robust enough for
most other Unicode string applications. By the same token,
community-supplied implementations that use the same API should work
with LPeG when someone gets the itch for an implementation that is
better optimized for their application.

And why I say "string API":

http://unicode.org/faq/utf_bom.html#utf32-4

Unless you want to get bogged down in formal details that are not
likely to be relevant to a working grammar, "c" is a string rather
than a character and it generates a match on a word containing "ch"
even in a Czech or Spanish locale.

Chris
--
Yippee-ki-yay, coffee maker.