Re: lpeg as a part of lua (was: An introduction to Lua)

On Mon, Oct 23, 2017 at 2:54 AM Sean Conner <sean@conman.org> wrote:

It was thus said that the Great Jonathan Goble once stated:
> On Sun, Oct 22, 2017 at 11:36 PM Sean Conner <sean@conman.org> wrote:
>
> <snip>
>
> Interesting analysis. One issue (some lines snipped)
>
> Pattern items:
> > x* lpeg.P"x"^0
> > x+ lpeg.P"x"^1
> > x? lpeg.P"x"^-1
> >
>
> The issue here is that all of these Lua pattern tokens are greedy, but not
> possessive; i.e. they will match the longest possible string, but then if
> the match subsequently fails, they will backtrack one character and try
> again, until either the match succeeds or they cannot backtrack any further
> because there are no more characters for it to give up.
>
> The corresponding LPeg pattern, on the other hand, is possessive, meaning
> that it will match as much as possible, but never backtrack. So if the
> pattern fails to match after that point, it will simply return failure,
> even if backtracking could have produced a successful match. This lack of
> backtracking is precisely why I raised the concern of LPeg not being able
> to do everything that standard patterns can do.
>
> The questions then are:
> 1. Can this limitation be worked around to create equivalent behavior?

What limitation?

lpeg = require "lpeg"

function test(reg,patt,test)
print(test:match(reg))
print(patt:match(test))
print()
end

test("x*",lpeg.C(lpeg.P"x"^0) ,"xxxxxxxxxxxxxxa")
test("x+",lpeg.C(lpeg.P"x"^1) ,"xxxxxxxxxxxxxxa")
test("x?",lpeg.C(lpeg.P"x"^-1),"xxxxxxxxxxxxxxa")

I added the lpeg.C() call to have LPeg return the same information as the
pattern will. LPeg does the match, and once the pattern matches, it
returns. There may be more data to parse, but if the LPeg pattern doesn't
parse it, it won't fail. Same as a pattern.

So what backtracking does Lua patterns do?

Your patterns work because they match without needing to backtrack. Here's an example that cannot match without backtracking, based on my original faulty example: Suppose you need to find a question mark, then match everything from that to the last dollar sign in the subject string.

The easy Lua pattern for that is "%?.*%$". The ".*" token will necessarily match the entire remainder of the subject string, leaving nothing for the "%$" token to attempt to match. The matching engine must, after matching everything, backtrack until it hits a dollar sign. In other words, we need repetition that is greedy (matches as much as possible) but not possessive (refuses to backtrack).

The LPeg pattern `(P(1) - patt)^0` does not work because that construct is lazy (matches the minimum possible), and will thus find the first dollar sign rather than the last. `P(1)^0 * P("$")` does not work because repetition with the "^" operator is possessive and will consume the entire subject string, meaning that the match attempt will always fail (IIUC) when it hits the `P("$")` token because there is nothing left.

How do you match a wildcard with repetition in a greedy, non-possessive manner with LPeg? AFAICT, it can't be done.

> (Come to think about it, my original example pattern shouldn't have used
> the "-" token; rather, I should have written something else that used * or
> + and required backtracking in order to produce a successful match. My
> brain is a bit fried after a long day today.)

Please do, I'd like to see one.

See above.

> > %n patt / "%n"
> > %b() Yes, see below ...
> > %f[set] Erm ... I think (P(1) - set)^0
> >
> > I would have to play around with the %f[set] pattern, not being terribly
> > familiar with it, but I think the LPeg I have for it is correct if I
> > understand the documentation.
> >
>
> I don't think so. From my reading of the LPeg docs, this would actually
> advance the match position, which %f[set] does not (it matches the empty
> string). Also, it seems to fail to test the previous character is not in
> the set, only testing the next character, and I'm not sure even that test
> is correct.

Then it might be: (P(1) - set) * #set

This matches one character as long as it isn't in the set, followed by a
character in the set, but don't consume the character in the set.

Closer, but not quite. This checks that the character at the current matching position is not in the set (and consumes that character) and that the next character is in the set. The frontier pattern actually should check that the character previous to the current position (i.e. the one just matched by the previous pattern token) is not in set and that the character at the current position is in the set, and this should all happen without consuming any characters from the subject.

The key is that this inherently requires a lookbehind, hence why I think only lpeg.Cmt plus a custom function can do it.

> Possibly this could be done with a custom function through lpeg.Cmt? (In
> fact, I think that might be the only way, since you have to look backwards
> at the previous character, which LPeg seems to be bad at doing on its own.)

I would have to see some patterns and test data to know for sure.

I would be interested to see such data, though I'm not sure where to start myself.