lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


It was thus said that the Great Jonathan Goble once stated:
> 
> Your patterns work because they match without needing to backtrack. Here's
> an example that cannot match without backtracking, based on my original
> faulty example: Suppose you need to find a question mark, then match
> everything from that to the last dollar sign in the subject string.
> 
> The easy Lua pattern for that is "%?.*%$". The ".*" token will necessarily
> match the entire remainder of the subject string, leaving nothing for the
> "%$" token to attempt to match. The matching engine must, after matching
> everything, backtrack until it hits a dollar sign. In other words, we need
> repetition that is greedy (matches as much as possible) but not possessive
> (refuses to backtrack).
> 
> The LPeg pattern `(P(1) - patt)^0` does not work because that construct is
> lazy (matches the minimum possible), and will thus find the first dollar
> sign rather than the last. `P(1)^0 * P("$")` does not work because
> repetition with the "^" operator is possessive and will consume the entire
> subject string, meaning that the match attempt will always fail (IIUC) when
> it hits the `P("$")` token because there is nothing left.
> 
> How do you match a wildcard with repetition in a greedy, non-possessive
> manner with LPeg? AFAICT, it can't be done.

  It can be done:

	lpeg = require "lpeg"
	C    = lpeg.C
	P    = lpeg.P
	
	lastdollar   = P"$" * (P(1) - P"$")^0 * P(-1)
	nolastdollar = (P(1) - lastdollar)^0
	parse        = P"?" * C(nolastdollar) * lastdollar
	
	print(parse:match "?12345$")            -- 12345
	print(parse:match "?12345$12345")       -- 12345
	print(parse:match "?12345$12345$")      -- 12345$12345
	print(parse:match "?12345$12345$123")   -- 12345$12345
	print(parse:match "?$")                 -- ""
	print(parse:match "?$12345")            -- ""
	print(parse:match "?$$")                -- $
	print(parse:match "?$$12345")           -- $
	print(parse:match "12345$")             -- nil

  Now a rundown of each production.

	lastdollar

		Match a dollar sign:			P"$"
		0 or more non-dollar characters:	* (P(1) - P"$")^0
		end of input:				* P(-1)

	nolastdollar

		Hold on to your hats---any character:	(P(1)
		that isn't the last dollar sign in
		the input:				- lastdollar)
		0 or more times:			^0

	parse

		Match a question mark:			P"?"
		Capture nolastdollar:			* C(nolastdollar)
		and a final dollar sign:		* lastdollar

  Yes, it's a bit more work to write that in LPeg, but I've shown it's
possible to do what you asked.

> > > >                 %f[set] Erm ... I think (P(1) - set)^0
> > > >
> > > >   I would have to play around with the %f[set] pattern, not being
> > > > terribly familiar with it, but I think the LPeg I have for it is
> > > > correct if I understand the documentation.
> > >
> > > I don't think so. From my reading of the LPeg docs, this would actually
> > > advance the match position, which %f[set] does not (it matches the empty
> > > string). Also, it seems to fail to test the previous character is not in
> > > the set, only testing the next character, and I'm not sure even that test
> > > is correct.
> >
> >   Then it might be:     (P(1) - set) * #set
> >
> >   This matches one character as long as it isn't in the set, followed by a
> > character in the set, but don't consume the character in the set.
> 
> Closer, but not quite. This checks that the character at the current
> matching position is not in the set (and consumes that character) and that
> the next character is in the set. The frontier pattern actually should
> check that the character previous to the current position (i.e. the one
> just matched by the previous pattern token) is not in set and that the
> character at the current position is in the set, and this should all happen
> without consuming any characters from the subject.

  Okay, then this should do it:

	lpeg.B(lpeg.P(1) - set) * #set + #set

About lpeg.B():

	Returns a pattern that matches only if the input string at the
	current position is preceded by patt. Pattern patt must match only
	strings with some fixed length, and it cannot contain captures.

	Like the and predicate, this pattern never consumes any input,
	independently of success or failure.

  The bit with lpeg.B() does the check before the current postion, and that
the next character is in the set without consuming the input.  The "+ #set"
is for the case when we're at the start of the input and there is no input
prior to the frontier.

  -spc