lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


It was thus said that the Great Jonathan Goble once stated:
> On Mon, Oct 23, 2017 at 5:45 PM Sean Conner <sean@conman.org> wrote:
> > It was thus said that the Great Jonathan Goble once stated:
> > >
> > > How do you match a wildcard with repetition in a greedy, non-possessive
> > > manner with LPeg? AFAICT, it can't be done.
> >
> >   It can be done:
> >
> >         lpeg = require "lpeg"
> >         C    = lpeg.C
> >         P    = lpeg.P
> >
> >         lastdollar   = P"$" * (P(1) - P"$")^0 * P(-1)
> >         nolastdollar = (P(1) - lastdollar)^0
> >         parse        = P"?" * C(nolastdollar) * lastdollar
> >
> >   Yes, it's a bit more work to write that in LPeg, but I've shown it's
> > possible to do what you asked.
> 
> Great! So greedy, non-possessive repetition is possible. However, I still
> consider it a major flaw in LPeg that a simple behavior that is the default
> in most regex flavors, including Lua's heavily simplified regex flavor that
> we call pattern matching, requires a complicated workaround that isn't
> immediately obvious. Much of my objection to LPeg would be removed if it
> would receive an update with native syntax for greedy, non-possessive
> matching. But at least it's possible.

  I think you are thinking about LPeg with the wrong mindset---yes, you can
look for patters in text with LPeg [1] but it's for *parsing*---pulling out
semantic information from text, rather than just patterns.  I've written a
lot of LPeg code [2], and not once have I needed a greedy, non-possessive
repetition to parse text.

  I mean, the following pattern:

	"%a+%s+(%a+)%s+(%d+)%s+(%d+)%:(%d+)%:(%d+)%s+(%d+)"

will match:

	"Mon Oct 23 02:03:27 2017"

but it will also match:

	rotary wankle 99 123:987:33 90210
	urban legend 0 0:0:0 0
	AAAAAAAAAAAAAAAAA	BBBBBBBBBBBBBBBBBBBB	000000000000000	0000000000000:0000:0	999999999999999999999999

  Whereas the LPeg code I prsented will fail with those last three, as they
don't represent the semantic data (a date) that I'm expecting to parse. 
Stop trying to directly replicate Lua patterns with LPeg and instead think
of pulling out semantic information from text.

> >   Okay, then this should do it:
> >
> >         lpeg.B(lpeg.P(1) - set) * #set + #set
> >
> > About lpeg.B():
> >
> >         Returns a pattern that matches only if the input string at the
> >         current position is preceded by patt. Pattern patt must match only
> >         strings with some fixed length, and it cannot contain captures.
> >
> >         Like the and predicate, this pattern never consumes any input,
> >         independently of success or failure.
> >
> >   The bit with lpeg.B() does the check before the current postion, and that
> > the next character is in the set without consuming the input.  The "+ #set"
> > is for the case when we're at the start of the input and there is no input
> > prior to the frontier.
> 
> Hmm. I had completely missed the B function. This looks like it would work
> in the middle of the string. However, frontier patterns have interesting
> behaviors at either end of the subject string. Specifically, they will
> treat the string as if it is both prefixed and suffixed by a "\0". I'm not
> sure how this would be implemented. I think you would have to read the set
> to decide if "\0" is in the set or not, then write a P(-1) or P(1) as
> necessary as the set when a frontier pattern is at the beginning or end of
> a Lua pattern. However, this doesn't handle the case of a frontier pattern
> in the middle of a Lua string, preceded or followed only by repetition that
> can (but doesn't always) match nothing. This edge case seems to be the only
> thing left for LPeg to fully duplicate Lua patterns.

  I was able to replicate the following with LPeg:

	string.gsub ("THE (QUICK) brOWN FOx JUMPS", "%f[%a]%u+%f[%A]", print)

(from http://lua-users.org/wiki/FrontierPattern) but it's not pretty (I'm
implementing a straight translation here):

	local lpeg = require "lpeg"
	local R  = lpeg.R
	local P  = lpeg.P
	local Cs = lpeg.Cs

	local upper     = R"AZ"
	local alpha     = R("AZ","az")
	local non_alpha = P(1) - alpha

	local parse = Cs(
	        (
	          (
	            (
	                lpeg.B(non_alpha) -- [a]
	              + P(function(_,p) if p == 1 then return p end end)
	            )
	            * upper^1
		    * (#non_alpha + P(-1)) -- [b]
                  )
	          / function(c) print(c) return c end
	          + P(1)
	        )^0
	)
	
	local test    = "THE (QUICK) brOWN FOx JUMPS"
	local luapatt = "%f[%a]%u+%f[%A]"
	string.gsub(test,luapatt,print)
	print()
	parse:match(test)

[a]	Here we're checking the previous character is NOT an alphabetic
	character (A-Z a-z).  If this fails, then either the previous
	character WAS an alphabetic character, or we're at the start of the
	string.  The second bit (P(function() ...) checks to see if we're at
	the start of the input.

	Taken together, this ensures we're at the start of a string or the
	previous character wasn't an alphabetic character.  This will match
	"THE" and not "brOWN".

[b]	This makes sure the following character is NOT an alphabetic
	character, or we're at the end of the input.  This matches "THE",
	"QUICK" and "JUMPS" but excludes "FOx".

  I will say that the frontier pattern is not something I've needed, either
with Lua patterns or with LPeg.  I can see it being useful for some parsing
tasks (like pulling out upper case words in the input) but so far, I've been
able to skip this type of pattern entirely.

  But again, it *is* possible with LPeg.

  -spc

[1]	I used LPeg to parse words from "The Wizard of Oz" for a project
	back in 2014.  You can see the code here:

	https://github.com/spc476/NaNoGenMo-2014/blob/master/markov.lua#L45

	It wasn't just a simple pattern of a sequence of alpha characters,
	but that, plus words with an embedded apostrophe (like "'tis" or
	"ain't") or dashes (like "non-possessve") or even periods (like
	"Dr." or "Mr.").

[2]	https://github.com/spc476/NaNoGenMo-2014
	https://github.com/spc476/LPeg-talk
	https://github.com/spc476/NaNoGenMo-2015
	https://github.com/spc476/LPeg-Parsers

	Plus more for work, including code to parse SIP (starting with
	RFC-3261 and a slew of other RFCs).