• Subject: Re: Argh! Regular expressions in Lua
• From: Gavin Wraith <gavin@...>
• Date: Wed, 22 Nov 2006 16:28:59 GMT

```In message <20061122135805.A32626@lua.tecgraf.puc-rio.br> you wrote:

> > is there a compelling reason for such limitation?
>
> Size and complexity. If you want to do complex pattern matching with
> full RE, then you can add it as an external libraries (and there are
> a few for Lua).

In the beginning of section 20.2 of "Programming in Lua" Roberto
explains the practical reasons for Lua having its own pattern
matching syntax.

Kleene's original work on regular expressions asserted that a set
of substrings of a string was detectable by a finite-state machine
if and only if it could be expressed in terms of single characters
using the following operations:

binary: alternation (x|y), concatenation (xy)
unary:  repetition (x*)
nullary: 0 (matches nothing), 1 (matches only the empty string)

Am I to understand that such an algebra presents too many costs,
in size or efficiency?

It seems to me that a practical Kleene-complete syntax for
pattern-matching needs to use at least three(!) kinds of parenthesis:

1) For disambiguating expressions e.g xy* or (xy)*.
[Lua's syntax gets round this by only allowing unary operations on
character-classes (i.e. patterns that denote 1-element substrings),
and not having a general alternation operation, so that ambiguities
do not appear.]

2) For denoting captures.

3) For demarcating variables that range over patterns.
[Lua does not allow variables in pattern expressions; instead one
must use string operations on patterns themselves,e.g.
extender = "%.%a%a%a" ]
Such a facility could be very convenient syntactic sugar.

Suppose for illustration that we use () for 1), <> for 2) and {} for 3).
Then if
x,y = "ab*c", "defg?"
the expression"{x}|{y}?" would be "(ab*c)|(defg?)?".
This can be expressed with strings as "("..x..")|("..y..")?",
which I claim is rather clumsier.

> Lua evolves by answering "why?" not "why not?".

I applaud the principle. I am trying to understand what it is in
the general algebra that causes the size and the complexity. Is it
a question of bounding the number of times that the string has
to be rescanned? In googling about I have come across pattern
matching packages that claim to be "compact and fast", e.g.
http://www.univ-rouen.fr/LIFAR/aia/ccp.html .
Maybe these epithets are relative.
--
Gavin Wraith (gavin@wra1th.plus.com)