[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Re: LPEG-based relaxed parsing again
- From: Sean Conner <sean@...>
- Date: Thu, 4 Sep 2014 03:49:44 -0400
It was thus said that the Great Paul K once stated:
>
> The question is: how do I write the expression that take zero or more
> repetitions of a pattern and (separately) captures all non-matching
> strings?
I'm not sure if it's possible. The following (using the 0.12 re module)
does kind of work:
re = require "re"
G = [[
block <- (do / expr / garbage)*
--block <- (do / expr )*
do <- {| DO block END |}
expr <- %s* { '(' DIGIT+ ')' }
garbage <- . ! ('do' / 'end' / expr)
DIGIT <- [0-9]
DO <- %s* { 'do' }
END <- %s* { 'end' }
SP <- %s*
]]
expr = re.compile(G)
-- dump() dumps the table. Code for this not included
dump("d",expr:match [[do(1)end]])
dump("d",expr:match [[do (1) do (3) end (3) (4) end]])
dump("d",expr:match [[do do end end]])
dump("d",expr:match [[do 23 end]])
d = 2.000000,
d =
{
[1] = "do",
[2] = "(1)",
[3] =
{
[1] = "do",
[2] = "(3)",
[3] = "end",
},
[4] = "(3)",
[5] = "(4)",
[6] = "end",
}
d =
{
[1] = "do",
[2] =
{
[1] = "do",
[2] = "end",
},
[3] = "end",
}
d =
{
[1] = "do",
[2] = "end",
}
So the first line (with no spaces) doesn't match, but it does ignore the
garbage input (last example). If you switch the definition of 'block', you
get:
d =
{
[1] = "do",
[2] = "(1)",
[3] = "end",
}
d =
{
[1] = "do",
[2] = "(1)",
[3] =
{
[1] = "do",
[2] = "(3)",
[3] = "end",
},
[4] = "(3)",
[5] = "(4)",
[6] = "end",
}
d =
{
[1] = "do",
[2] =
{
[1] = "do",
[2] = "end",
},
[3] = "end",
}
d = 1.000000,
I haven't been able to get both the first and the last example to parse
correctly, and I've tried just about every combination (for example,
"garbage <- [^%s]+") I can think of to no avail. Personally (and in my
not-so-humble opinion) trying to carry on in the presence of garbage can
only lead to trouble (witness the horrors of parsing arbitrary HTML).
-spc (What's the saying? ... garbage in, garbage out?)