Re: LPEG-based relaxed parsing again

lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

Subject: Re: LPEG-based relaxed parsing again
From: Sean Conner <sean@...>
Date: Thu, 4 Sep 2014 03:49:44 -0400

It was thus said that the Great Paul K once stated:
> 
> The question is: how do I write the expression that take zero or more
> repetitions of a pattern and (separately) captures all non-matching
> strings?

  I'm not sure if it's possible.  The following (using the 0.12 re module)
does kind of work:

	re = require "re"

	G = [[
	
	block           <- (do / expr / garbage)*
	--block         <- (do / expr )*
	
	do              <- {| DO block END |}
	expr            <- %s* { '(' DIGIT+ ')' }
	
	garbage         <- . ! ('do' / 'end' / expr)
	
	DIGIT           <- [0-9]
	DO              <- %s* { 'do'  }
	END             <- %s* { 'end' }
	SP              <- %s*
	  
	]]
	
	expr = re.compile(G)

	-- dump() dumps the table.  Code for this not included

	dump("d",expr:match [[do(1)end]])
	dump("d",expr:match [[do (1) do (3) end (3) (4) end]])
	dump("d",expr:match [[do do end end]])
	dump("d",expr:match [[do 23 end]])

d = 2.000000,

d =
{
  [1] = "do",
  [2] = "(1)",
  [3] =
  {
    [1] = "do",
    [2] = "(3)",
    [3] = "end",
  },
  [4] = "(3)",
  [5] = "(4)",
  [6] = "end",
}

d =
{
  [1] = "do",
  [2] =
  {
    [1] = "do",
    [2] = "end",
  },
  [3] = "end",
}

d =
{
  [1] = "do",
  [2] = "end",
}

  So the first line (with no spaces) doesn't match, but it does ignore the
garbage input (last example).  If you switch the definition of 'block', you
get:

d =
{
  [1] = "do",
  [2] = "(1)",
  [3] = "end",
}

d =
{
  [1] = "do",
  [2] = "(1)",
  [3] =
  {
    [1] = "do",
    [2] = "(3)",
    [3] = "end",
  },
  [4] = "(3)",
  [5] = "(4)",
  [6] = "end",
}

d =
{
  [1] = "do",
  [2] =
  {
    [1] = "do",
    [2] = "end",
  },
  [3] = "end",
}

d = 1.000000,

  I haven't been able to get both the first and the last example to parse
correctly, and I've tried just about every combination (for example,
"garbage <- [^%s]+") I can think of to no avail.  Personally (and in my
not-so-humble opinion) trying to carry on in the presence of garbage can
only lead to trouble (witness the horrors of parsing arbitrary HTML). 

  -spc (What's the saying? ... garbage in, garbage out?)

Follow-Ups:
- Re: LPEG-based relaxed parsing again, Paul K

References:
- LPEG-based relaxed parsing again, Paul K

Prev by Date: Re: LPEG-based relaxed parsing again
Next by Date: Re: SILE 0.9.0 is released
Previous by thread: Re: LPEG-based relaxed parsing again
Next by thread: Re: LPEG-based relaxed parsing again
Index(es):
- Date
- Thread