Re: lpeg as a part of lua (was: An introduction to Lua)

On Mon, Oct 23, 2017 at 20:47 Sean Conner <sean@conman.org> wrote:

It was thus said that the Great Jonathan Goble once stated:
> On Mon, Oct 23, 2017 at 5:45 PM Sean Conner <sean@conman.org> wrote:
> > It was thus said that the Great Jonathan Goble once stated:
> > >
> > > How do you match a wildcard with repetition in a greedy, non-possessive
> > > manner with LPeg? AFAICT, it can't be done.
> >
> > It can be done:
> >
> > lpeg = require "lpeg"
> > C = lpeg.C
> > P = lpeg.P
> >
> > lastdollar = P"$" * (P(1) - P"$")^0 * P(-1)
> > nolastdollar = (P(1) - lastdollar)^0
> > parse = P"?" * C(nolastdollar) * lastdollar
> >
> > Yes, it's a bit more work to write that in LPeg, but I've shown it's
> > possible to do what you asked.
>
> Great! So greedy, non-possessive repetition is possible. However, I still
> consider it a major flaw in LPeg that a simple behavior that is the default
> in most regex flavors, including Lua's heavily simplified regex flavor that
> we call pattern matching, requires a complicated workaround that isn't
> immediately obvious. Much of my objection to LPeg would be removed if it
> would receive an update with native syntax for greedy, non-possessive
> matching. But at least it's possible.

I think you are thinking about LPeg with the wrong mindset---yes, you can
look for patters in text with LPeg [1] but it's for *parsing*---pulling out
semantic information from text, rather than just patterns. I've written a
lot of LPeg code [2], and not once have I needed a greedy, non-possessive
repetition to parse text.

I mean, the following pattern:

"%a+%s+(%a+)%s+(%d+)%s+(%d+)%:(%d+)%:(%d+)%s+(%d+)"

will match:

"Mon Oct 23 02:03:27 2017"

but it will also match:

rotary wankle 99 123:987:33 90210
urban legend 0 0:0:0 0
AAAAAAAAAAAAAAAAA BBBBBBBBBBBBBBBBBBBB 000000000000000 0000000000000:0000:0 999999999999999999999999

Whereas the LPeg code I prsented will fail with those last three, as they
don't represent the semantic data (a date) that I'm expecting to parse.
Stop trying to directly replicate Lua patterns with LPeg and instead think
of pulling out semantic information from text.

> > Okay, then this should do it:
> >
> > lpeg.B(lpeg.P(1) - set) * #set + #set
> >
> > About lpeg.B():
> >
> > Returns a pattern that matches only if the input string at the
> > current position is preceded by patt. Pattern patt must match only
> > strings with some fixed length, and it cannot contain captures.
> >
> > Like the and predicate, this pattern never consumes any input,
> > independently of success or failure.
> >
> > The bit with lpeg.B() does the check before the current postion, and that
> > the next character is in the set without consuming the input. The "+ #set"
> > is for the case when we're at the start of the input and there is no input
> > prior to the frontier.
>
> Hmm. I had completely missed the B function. This looks like it would work
> in the middle of the string. However, frontier patterns have interesting
> behaviors at either end of the subject string. Specifically, they will
> treat the string as if it is both prefixed and suffixed by a "\0". I'm not
> sure how this would be implemented. I think you would have to read the set
> to decide if "\0" is in the set or not, then write a P(-1) or P(1) as
> necessary as the set when a frontier pattern is at the beginning or end of
> a Lua pattern. However, this doesn't handle the case of a frontier pattern
> in the middle of a Lua string, preceded or followed only by repetition that
> can (but doesn't always) match nothing. This edge case seems to be the only
> thing left for LPeg to fully duplicate Lua patterns.

I was able to replicate the following with LPeg:

string.gsub ("THE (QUICK) brOWN FOx JUMPS", "%f[%a]%u+%f[%A]", print)

(from http://lua-users.org/wiki/FrontierPattern) but it's not pretty (I'm
implementing a straight translation here):

local lpeg = require "lpeg"
local R = lpeg.R
local P = lpeg.P
local Cs = lpeg.Cs

local upper = R"AZ"
local alpha = R("AZ","az")
local non_alpha = P(1) - alpha

local parse = Cs(
(
(
(
lpeg.B(non_alpha) -- [a]
+ P(function(_,p) if p == 1 then return p end end)
)
* upper^1
* (#non_alpha + P(-1)) -- [b]
)
/ function(c) print(c) return c end
+ P(1)
)^0
)

local test = "THE (QUICK) brOWN FOx JUMPS"
local luapatt = "%f[%a]%u+%f[%A]"
string.gsub(test,luapatt,print)
print()
parse:match(test)

[a] Here we're checking the previous character is NOT an alphabetic
character (A-Z a-z). If this fails, then either the previous
character WAS an alphabetic character, or we're at the start of the
string. The second bit (P(function() ...) checks to see if we're at
the start of the input.

Taken together, this ensures we're at the start of a string or the
previous character wasn't an alphabetic character. This will match
"THE" and not "brOWN".

[b] This makes sure the following character is NOT an alphabetic
character, or we're at the end of the input. This matches "THE",
"QUICK" and "JUMPS" but excludes "FOx".

I will say that the frontier pattern is not something I've needed, either
with Lua patterns or with LPeg. I can see it being useful for some parsing
tasks (like pulling out upper case words in the input) but so far, I've been
able to skip this type of pattern entirely.

But again, it *is* possible with LPeg.

-spc

[1] I used LPeg to parse words from "The Wizard of Oz" for a project
back in 2014. You can see the code here:

https://github.com/spc476/NaNoGenMo-2014/blob/master/markov.lua#L45

It wasn't just a simple pattern of a sequence of alpha characters,
but that, plus words with an embedded apostrophe (like "'tis" or
"ain't") or dashes (like "non-possessve") or even periods (like
"Dr." or "Mr.").

[2] https://github.com/spc476/NaNoGenMo-2014
https://github.com/spc476/LPeg-talk
https://github.com/spc476/NaNoGenMo-2015
https://github.com/spc476/LPeg-Parsers

Plus more for work, including code to parse SIP (starting with
RFC-3261 and a slew of other RFCs).