[Date Prev][Date Next][Thread Prev][Thread Next]
- Subject: lpeg.cut? (Re: Elegant design for creating error messages in LPEG parser)
- From: nobody <nobody+lua-list@...>
- Date: Thu, 4 Apr 2019 00:28:20 +0200
On 03/04/2019 20.56, Sean Conner wrote:
Any failed pattern would return nil,position, but as long as there are
alternatives, it won't matter. But on a failure, at least you get the
offset into the string being parsed where the error was.
That's about the minimum I could see for LPEG. How about it Roberto?
would that just be to record line number as you go along using Carg(1) and
just print out the line where your parser fails ?
I feel like the minimum viable error would handle unknown unknowns without
being completely useless like nil, while keeping
parser code simple. ( I do not want to end up in a situation where the code
is 50% error handling ).
For the majority of my LPEG programs, I've been able to get away with
parsed vs. failed, as there wasn't much I could do about a failed parse
(especially when parsing SIP messages---log the failure, drop it and move on
to the next message). But having a position of failure would be nice.
So far, almost every single time I used LPEG, I spent upwards of an hour
(sometimes 6+ hours) on debugging. If you have a grammar and "only"
have to translate it to LPEG, identifying a problem is usually
manageable. But if you're trying to incrementally reconstruct a grammar
from a bunch of known samples, this is really really painful. With very
large or otherwise hard to inspect files (binary etc.), if the 1234th
repetition of some structure has an extra field, the only way I know to
identify the problem is to do lots of match time print()ing…
As far as I can tell, part of the problem is that all branches are tried
recursively – i.e. match failures at any point are expected and don't
mean there's actually a problem, and so there's no hard information
available that could be printed after all branches failed.
My observation is that very often, there are points in the grammar /
pattern, where trying alternatives is known to be useless. (If there was
a match for '<entity ' and now 'id=' is expected but 'ref=' is found, it
doesn't make sense to backtrack and try '<message ' etc. – they
certainly won't match – but LPEG doesn't know that.)
## proposal / question
Would it make sense to add a way to tell LPEG "do not backtrack past
this point" – e.g. by 'lpeg.cut( )'? (I'm taking the name from Prolog –
maybe there's a better name?) With cuts, there *would* be hard known
information that could be printed: The position in the input when LPEG
attempted to backtrack over the cut.
Going a step further, `lpeg.cut( [name] )` could (maybe) be used to
produce something like a stack trace? (I haven't looked at the LPEG
internals, don't know how hard/easy this would be.)
With cut, I could have
entity = lpeg.P "<entity" * WS^1 * lpeg.cut( "tag:entity" ) * "id=" …
and then LPEG has enough information to tell me that in 'tag:entity' at
position 12345 (in 'tag:state' at position 123 in 'tag:savegame' at
position 10) no alternative matched, and by using the position I can
grab the next couple of lexemes (or bytes) from the file, and then I
know that there was 'ref=' instead of 'id=' and debugging would be *so*
At least that's the dream… Would that actually work? And is this
sufficiently compatible with LPEG's internals? Or is that maybe