lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


It was thus said that the Great Jonathan Goble once stated:
> On Sun, Oct 22, 2017 at 11:36 PM Sean Conner <sean@conman.org> wrote:
> 
> <snip>
> 
> Interesting analysis. One issue (some lines snipped)
> 
>   Pattern items:
> >                 x*      lpeg.P"x"^0
> >                 x+      lpeg.P"x"^1
> >                 x?      lpeg.P"x"^-1
> >
> 
> The issue here is that all of these Lua pattern tokens are greedy, but not
> possessive; i.e. they will match the longest possible string, but then if
> the match subsequently fails, they will backtrack one character and try
> again, until either the match succeeds or they cannot backtrack any further
> because there are no more characters for it to give up.
> 
> The corresponding LPeg pattern, on the other hand, is possessive, meaning
> that it will match as much as possible, but never backtrack. So if the
> pattern fails to match after that point, it will simply return failure,
> even if backtracking could have produced a successful match. This lack of
> backtracking is precisely why I raised the concern of LPeg not being able
> to do everything that standard patterns can do.
> 
> The questions then are:
> 1. Can this limitation be worked around to create equivalent behavior?

  What limitation? 

	lpeg = require "lpeg"

	function test(reg,patt,test)
	  print(test:match(reg))
	  print(patt:match(test))
	  print()
	end

	test("x*",lpeg.C(lpeg.P"x"^0) ,"xxxxxxxxxxxxxxa")
	test("x+",lpeg.C(lpeg.P"x"^1) ,"xxxxxxxxxxxxxxa")
	test("x?",lpeg.C(lpeg.P"x"^-1),"xxxxxxxxxxxxxxa")

  I added the lpeg.C() call to have LPeg return the same information as the
pattern will.  LPeg does the match, and once the pattern matches, it
returns.  There may be more data to parse, but if the LPeg pattern doesn't
parse it, it won't fail.  Same as a pattern.

  So what backtracking does Lua patterns do?

> 2. If it can be worked around, how easy or complicated is it to do so?

  Yes.  I don't know?  What are you trying to parse?

  For an example, let's say you are wanting to parse a date, like:

	test = "Mon Oct 23 02:03:27 2017"

Okay, the Lua pattern for that might be:

	date_pattern = "%a+%s+(%a+)%s+(%d+)%s+(%d+)%:(%d+)%:(%d+)%s+(%d+)"
	
	month,day,hour,min,sec,year = test:match(date_pattern)

Yeah, you can do that in Lua:

	a = lpeg.R("AZ","az")
	d = lpeg.R"09"
	s = lpeg.S" \t\v\r\n"
	date_pattern = a^1         * s
	             * lpeg.C(a^1) * s
	             * lpeg.C(d^1) * s
	             * lpeg.C(d^1) * lpeg.P":"
	             * lpeg.C(d^1) * lpeg.P":"
	             * lpeg.C(d^1) * s
	             * lpeg.C(d^1)
	             
	month,day,hour,min,sec,year = date_pattern:match(test)

This is a literal translation of the Lua pattern into LPeg. But that's a
waste of LPeg if you ask me.  Not only is it longer, but it's not showing
off what you can actually *get* out of LPeg.  A much better example would
be:

	d = lpeg.R"09"
	s = lpeg.S" \t\v\r\n"
	
	function range(low,high)
	  return lpeg.Cmt(d^1,function(subject,position,capture)
	    local v = tostring(capture)
	    if capture >= low and capture <= high then
	      return position,v
	    end
	  end)
	end
	
	day   = lpeg.P"Sun" * lpeg.Cc(1) + lpeg.P"Mon" * lpeg.Cc(2)
	      + lpeg.P"Tue" * lpeg.Cc(3) + lpeg.P"Wed" * lpeg.Cc(4)
	      + lpeg.P"Thu" * lpeg.Cc(5) + lpeg.P"Fri" * lpeg.Cc(6)
	      + lpeg.P"Sat" * lpeg.Cc(7)
	      
	month = lpeg.P"Jan" * lpeg.Cc( 1) + lpeg.P"Feb" * lpeg.Cc( 2)
              + lpeg.P"Mar" * lpeg.Cc( 3) + lpeg.P"Apr" * lpeg.Cc( 4)
              + lpeg.P"May" * lpeg.Cc( 5) + lpeg.P"Jun" * lpeg.Cc( 6)
              + lpeg.P"Jul" * lpeg.Cc( 7) + lpeg.P"Aug" * lpeg.Cc( 8)
              + lpeg.P"Sep" * lpeg.Cc( 9) + lpeg.P"Oct" * lpeg.Cc(10)
              + lpeg.P"Nov" * lpeg.Cc(11) + lpeg.P"Dec" * lpeg.Cc(12)
              
	date_pattern = lpeg.Ct(
		  lpeg.Cg(day,"wday")         * s
		* lpeg.Cg(month,"month")      * s
		* lpeg.Cg(range(1,31),"day")  * s
		* lpeg.Cg(range(0,23),"hour") * lpeg.P":"
		* lpeg.Cg(range(0,59),"min")  * lpeg.P":"
		* lpeg.Cg(range(0,59),"sec")  * s
		* lpeg.Cg(range(1900,2200),"year")
	)
	
	time = date_pattern:match(test)
	
  Yes, it's even longer still, but what you get is *structured* data (mostly
validated---you still have to check the month and day to see if you ended up
with Feb 30th, but that's *still* more than you got with Lua patterns). 
*And* you can reuse date_pattern any time you need to parse a date like
that.  The date, in a table, usable for os.date().

  There are other ways of handling the day and month productions, but the
above is my preference.  For those that don't like typing:

	day = lpeg.P(false)
	for i,d in pairs { 'Sun', 'Mon' , 'Tue' , 'Wed' , 'Thu' , 'Fri' , 'Sat' } do
	  day  = day + lpeg.P(d) * lpeg.Cc(i)
	end

or

	day = lpeg.P(3) / {
		Sun = 1 , Mon = 2 , Tue = 3 , Wed = 4 , 
		Thu = 5 , Fri = 6 , Sat = 7
	}
	
although that doesn't give you a match failure, just nil for wday.  It just
really depends on how early you want to error out.

> (Come to think about it, my original example pattern shouldn't have used
> the "-" token; rather, I should have written something else that used * or
> + and required backtracking in order to produce a successful match. My
> brain is a bit fried after a long day today.)

  Please do, I'd like to see one.
  
> >                 %n      patt / "%n"
> >                 %b()    Yes, see below ...
> >                 %f[set] Erm ... I think (P(1) - set)^0
> >
> >   I would have to play around with the %f[set] pattern, not being terribly
> > familiar with it, but I think the LPeg I have for it is correct if I
> > understand the documentation.
> >
> 
> I don't think so. From my reading of the LPeg docs, this would actually
> advance the match position, which %f[set] does not (it matches the empty
> string). Also, it seems to fail to test the previous character is not in
> the set, only testing the next character, and I'm not sure even that test
> is correct.

  Then it might be:	(P(1) - set) * #set
  
  This matches one character as long as it isn't in the set, followed by a
character in the set, but don't consume the character in the set.

> Possibly this could be done with a custom function through lpeg.Cmt? (In
> fact, I think that might be the only way, since you have to look backwards
> at the previous character, which LPeg seems to be bad at doing on its own.)

  I would have to see some patterns and test data to know for sure.  
  
> >   And I content that the only reason LPeg looks difficult is that you
> > are not familiar with it.  I personally find it difficult to read Lua
> > patterns (and even regexs that tend to look like line noise to me) but
> > that's because I rarely use them, instead using LPeg.
> 
> Quite possible. I learned full regular expressions before I came to Lua, so
> I found Lua patterns and their regex-like syntax easy to understand. LPeg
> is so radically different from those that it's hard for me to make the huge
> adjustment necessary to grasp it.

  I never learned regex (it looks too much like line noise, making it, to
me, a write only language) and never did care much for it.

  -spc