Re: LPEG - next version

lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

Subject: Re: LPEG - next version
From: Florian Weimer <fw@...>
Date: Thu, 11 Jun 2009 21:45:54 +0200

* Roberto Ierusalimschy:

>> Just wondering, what is on the roadmap for LPEG?  Any ideas as to when
>> it would be out?
>
> I must confess I am currently stuck. I think LPEG should support Unicode
> (through UTF-8), but I have no idea what "to support Unicode" means :)

P(1) needs to turn into

  R("\000\127") +
  R("\194\223") * R("\128\191") +
  R("\224\240") * R("\128\191") * R("\128\191") +
  R("\241\244") * R("\128\191") * R("\128\191") * R("\128\191")

or something similar (it's more difficult if you want to rule out
invalid UTF-8 sequences such as overlong encodings or surrogates).
Simplifications are possible if you don't care about illegal input
(but it makes sense to follow what Markus Kuhn's test file does with
them: <http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt>).

However, this is just very basic support.  P(1) could also match
anything which fits into one terminal cell (or two, for double-width
characters), for instance LATIN SMALL LETTER C followed by COMBINING
CEDILLA.

The Unicode folks have some ideas, but I don't think many engines work
this way: <http://unicode.org/reports/tr18/> PCRE has also some
extended support that allows to match grapheme clusters.  Character
properties are widely supported, but the tables can be quite large.

> Other than that, there are a few details:

Have you considered adding ternary choice to the bytecode, based on
ternary search trees?  See: <http://www.cs.princeton.edu/~rs/strings/>

Follow-Ups:
- Re: LPEG - next version, Miles Bader

References:
- LPEG - next version, Thomas Harning Jr.
- Re: LPEG - next version, Roberto Ierusalimschy

Prev by Date: Re: Next Version of Lua? - Bitwise Ops & Enum/Flags
Next by Date: Announcement: new Kahlua milestone released
Previous by thread: Re: LPEG - next version
Next by thread: Re: LPEG - next version
Index(es):
- Date
- Thread