lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


On Mon, Oct 23, 2017 at 8:47 PM Sean Conner <sean@conman.org> wrote:
It was thus said that the Great Jonathan Goble once stated:
> On Mon, Oct 23, 2017 at 5:45 PM Sean Conner <sean@conman.org> wrote:
> > It was thus said that the Great Jonathan Goble once stated:
> > >
> > > How do you match a wildcard with repetition in a greedy, non-possessive
> > > manner with LPeg? AFAICT, it can't be done.
> >
> >   It can be done:
> >
> >         lpeg = require "lpeg"
> >         C    = lpeg.C
> >         P    = lpeg.P
> >
> >         lastdollar   = P"$" * (P(1) - P"$")^0 * P(-1)
> >         nolastdollar = (P(1) - lastdollar)^0
> >         parse        = P"?" * C(nolastdollar) * lastdollar
> >
> >   Yes, it's a bit more work to write that in LPeg, but I've shown it's
> > possible to do what you asked.
>
> Great! So greedy, non-possessive repetition is possible. However, I still
> consider it a major flaw in LPeg that a simple behavior that is the default
> in most regex flavors, including Lua's heavily simplified regex flavor that
> we call pattern matching, requires a complicated workaround that isn't
> immediately obvious. Much of my objection to LPeg would be removed if it
> would receive an update with native syntax for greedy, non-possessive
> matching. But at least it's possible.

  I think you are thinking about LPeg with the wrong mindset---yes, you can
look for patters in text with LPeg [1] but it's for *parsing*---pulling out
semantic information from text, rather than just patterns.  I've written a
lot of LPeg code [2], and not once have I needed a greedy, non-possessive
repetition to parse text.

<snip>

  Whereas the LPeg code I prsented will fail with those last three, as they
don't represent the semantic data (a date) that I'm expecting to parse.
Stop trying to directly replicate Lua patterns with LPeg and instead think
of pulling out semantic information from text.
 
I was meaning to refer to the original comment to which I replied last night, kicking this debate off, about LPeg possibly replacing Lua patterns in the stock Lua distribution in the future. That LPeg can in fact do everything Lua patterns can do removes one of my two objections to that. The other (relative difficulty of learning) still stands.

> >   Okay, then this should do it:
> >
> >         lpeg.B(lpeg.P(1) - set) * #set + #set
> >
> > About lpeg.B():
> >
> >         Returns a pattern that matches only if the input string at the
> >         current position is preceded by patt. Pattern patt must match only
> >         strings with some fixed length, and it cannot contain captures.
> >
> >         Like the and predicate, this pattern never consumes any input,
> >         independently of success or failure.
> >
> >   The bit with lpeg.B() does the check before the current postion, and that
> > the next character is in the set without consuming the input.  The "+ #set"
> > is for the case when we're at the start of the input and there is no input
> > prior to the frontier.
>
> Hmm. I had completely missed the B function. This looks like it would work
> in the middle of the string. However, frontier patterns have interesting
> behaviors at either end of the subject string. Specifically, they will
> treat the string as if it is both prefixed and suffixed by a "\0". I'm not
> sure how this would be implemented. I think you would have to read the set
> to decide if "\0" is in the set or not, then write a P(-1) or P(1) as
> necessary as the set when a frontier pattern is at the beginning or end of
> a Lua pattern. However, this doesn't handle the case of a frontier pattern
> in the middle of a Lua string, preceded or followed only by repetition that
> can (but doesn't always) match nothing. This edge case seems to be the only
> thing left for LPeg to fully duplicate Lua patterns.

  I was able to replicate the following with LPeg:

        string.gsub ("THE (QUICK) brOWN FOx JUMPS", "%f[%a]%u+%f[%A]", print)

(from http://lua-users.org/wiki/FrontierPattern) but it's not pretty (I'm
implementing a straight translation here):

        local lpeg = require "lpeg"
        local R  = lpeg.R
        local P  = lpeg.P
        local Cs = lpeg.Cs

        local upper     = R"AZ"
        local alpha     = R("AZ","az")
        local non_alpha = P(1) - alpha

        local parse = Cs(
                (
                  (
                    (
                        lpeg.B(non_alpha) -- [a]
                      + P(function(_,p) if p == 1 then return p end end)
                    )
                    * upper^1
                    * (#non_alpha + P(-1)) -- [b]
                  )
                  / function(c) print(c) return c end
                  + P(1)
                )^0
        )

        local test    = "THE (QUICK) brOWN FOx JUMPS"
        local luapatt = "%f[%a]%u+%f[%A]"
        string.gsub(test,luapatt,print)
        print()
        parse:match(test)

[a]     Here we're checking the previous character is NOT an alphabetic
        character (A-Z a-z).  If this fails, then either the previous
        character WAS an alphabetic character, or we're at the start of the
        string.  The second bit (P(function() ...) checks to see if we're at
        the start of the input.

        Taken together, this ensures we're at the start of a string or the
        previous character wasn't an alphabetic character.  This will match
        "THE" and not "brOWN".

[b]     This makes sure the following character is NOT an alphabetic
        character, or we're at the end of the input.  This matches "THE",
        "QUICK" and "JUMPS" but excludes "FOx".

  I will say that the frontier pattern is not something I've needed, either
with Lua patterns or with LPeg.  I can see it being useful for some parsing
tasks (like pulling out upper case words in the input) but so far, I've been
able to skip this type of pattern entirely.

  But again, it *is* possible with LPeg.

Fair enough. And the the way frontier patterns use "\0" like that is probably a quirk of the implementation (hey, the string is stored with a terminating "\0", why not use it and treat the beginning the same way?), and wouldn't necessarily be the same if patterns were implemented differently. So I'm not too bothered by it, and if I do try the LPeg implementation of patterns, I might tweak frontier patterns in that regard.