lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


On Mon, Oct 23, 2017 at 5:45 PM Sean Conner <sean@conman.org> wrote:
It was thus said that the Great Jonathan Goble once stated:
>
> Your patterns work because they match without needing to backtrack. Here's
> an example that cannot match without backtracking, based on my original
> faulty example: Suppose you need to find a question mark, then match
> everything from that to the last dollar sign in the subject string.
>
> The easy Lua pattern for that is "%?.*%$". The ".*" token will necessarily
> match the entire remainder of the subject string, leaving nothing for the
> "%$" token to attempt to match. The matching engine must, after matching
> everything, backtrack until it hits a dollar sign. In other words, we need
> repetition that is greedy (matches as much as possible) but not possessive
> (refuses to backtrack).
>
> The LPeg pattern `(P(1) - patt)^0` does not work because that construct is
> lazy (matches the minimum possible), and will thus find the first dollar
> sign rather than the last. `P(1)^0 * P("$")` does not work because
> repetition with the "^" operator is possessive and will consume the entire
> subject string, meaning that the match attempt will always fail (IIUC) when
> it hits the `P("$")` token because there is nothing left.
>
> How do you match a wildcard with repetition in a greedy, non-possessive
> manner with LPeg? AFAICT, it can't be done.

  It can be done:

        lpeg = require "lpeg"
        C    = lpeg.C
        P    = lpeg.P

        lastdollar   = P"$" * (P(1) - P"$")^0 * P(-1)
        nolastdollar = (P(1) - lastdollar)^0
        parse        = P"?" * C(nolastdollar) * lastdollar

        print(parse:match "?12345$")            -- 12345
        print(parse:match "?12345$12345")       -- 12345
        print(parse:match "?12345$12345$")      -- 12345$12345
        print(parse:match "?12345$12345$123")   -- 12345$12345
        print(parse:match "?$")                 -- ""
        print(parse:match "?$12345")            -- ""
        print(parse:match "?$$")                -- $
        print(parse:match "?$$12345")           -- $
        print(parse:match "12345$")             -- nil

  Now a rundown of each production.

        lastdollar

                Match a dollar sign:                    P"$"
                0 or more non-dollar characters:        * (P(1) - P"$")^0
                end of input:                           * P(-1)

        nolastdollar

                Hold on to your hats---any character:   (P(1)
                that isn't the last dollar sign in
                the input:                              - lastdollar)
                0 or more times:                        ^0

        parse

                Match a question mark:                  P"?"
                Capture nolastdollar:                   * C(nolastdollar)
                and a final dollar sign:                * lastdollar

  Yes, it's a bit more work to write that in LPeg, but I've shown it's
possible to do what you asked.

Great! So greedy, non-possessive repetition is possible. However, I still consider it a major flaw in LPeg that a simple behavior that is the default in most regex flavors, including Lua's heavily simplified regex flavor that we call pattern matching, requires a complicated workaround that isn't immediately obvious. Much of my objection to LPeg would be removed if it would receive an update with native syntax for greedy, non-possessive matching. But at least it's possible.
 
> > > >                 %f[set] Erm ... I think (P(1) - set)^0
> > > >
> > > >   I would have to play around with the %f[set] pattern, not being
> > > > terribly familiar with it, but I think the LPeg I have for it is
> > > > correct if I understand the documentation.
> > >
> > > I don't think so. From my reading of the LPeg docs, this would actually
> > > advance the match position, which %f[set] does not (it matches the empty
> > > string). Also, it seems to fail to test the previous character is not in
> > > the set, only testing the next character, and I'm not sure even that test
> > > is correct.
> >
> >   Then it might be:     (P(1) - set) * #set
> >
> >   This matches one character as long as it isn't in the set, followed by a
> > character in the set, but don't consume the character in the set.
>
> Closer, but not quite. This checks that the character at the current
> matching position is not in the set (and consumes that character) and that
> the next character is in the set. The frontier pattern actually should
> check that the character previous to the current position (i.e. the one
> just matched by the previous pattern token) is not in set and that the
> character at the current position is in the set, and this should all happen
> without consuming any characters from the subject.

  Okay, then this should do it:

        lpeg.B(lpeg.P(1) - set) * #set + #set

About lpeg.B():

        Returns a pattern that matches only if the input string at the
        current position is preceded by patt. Pattern patt must match only
        strings with some fixed length, and it cannot contain captures.

        Like the and predicate, this pattern never consumes any input,
        independently of success or failure.

  The bit with lpeg.B() does the check before the current postion, and that
the next character is in the set without consuming the input.  The "+ #set"
is for the case when we're at the start of the input and there is no input
prior to the frontier.

  -spc

Hmm. I had completely missed the B function. This looks like it would work in the middle of the string. However, frontier patterns have interesting behaviors at either end of the subject string. Specifically, they will treat the string as if it is both prefixed and suffixed by a "\0". I'm not sure how this would be implemented. I think you would have to read the set to decide if "\0" is in the set or not, then write a P(-1) or P(1) as necessary as the set when a frontier pattern is at the beginning or end of a Lua pattern. However, this doesn't handle the case of a frontier pattern in the middle of a Lua string, preceded or followed only by repetition that can (but doesn't always) match nothing. This edge case seems to be the only thing left for LPeg to fully duplicate Lua patterns.

Thanks for the help in figuring out LPeg. Now that I understand it a lot better, if I have time amidst my schoolwork (and that's a big if), I might make a deep dive into LPeg and experiment with writing a module that reimplements the entire Lua pattern system using LPeg.