lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


* Roberto Ierusalimschy:

>> Just wondering, what is on the roadmap for LPEG?  Any ideas as to when
>> it would be out?
>
> I must confess I am currently stuck. I think LPEG should support Unicode
> (through UTF-8), but I have no idea what "to support Unicode" means :)

P(1) needs to turn into

  R("\000\127") +
  R("\194\223") * R("\128\191") +
  R("\224\240") * R("\128\191") * R("\128\191") +
  R("\241\244") * R("\128\191") * R("\128\191") * R("\128\191")

or something similar (it's more difficult if you want to rule out
invalid UTF-8 sequences such as overlong encodings or surrogates).
Simplifications are possible if you don't care about illegal input
(but it makes sense to follow what Markus Kuhn's test file does with
them: <http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt>).

However, this is just very basic support.  P(1) could also match
anything which fits into one terminal cell (or two, for double-width
characters), for instance LATIN SMALL LETTER C followed by COMBINING
CEDILLA.

The Unicode folks have some ideas, but I don't think many engines work
this way: <http://unicode.org/reports/tr18/> PCRE has also some
extended support that allows to match grapheme clusters.  Character
properties are widely supported, but the tables can be quite large.

> Other than that, there are a few details:

Have you considered adding ternary choice to the bytecode, based on
ternary search trees?  See: <http://www.cs.princeton.edu/~rs/strings/>