[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Re: Yieldable/streaming LPEG
- From: Sean Conner <sean@...>
- Date: Mon, 1 Sep 2014 03:55:39 -0400
It was thus said that the Great Paul K once stated:
> Hi Sean,
>
> > But I think that if you can break your parsing up into natural "units",
> > you should be able to parse a bit and if you fail to parse, add more data
> > until you can. As a proof-of-concept:
>
> This looks good; thank you for the PoC! I have something similar
> (although less elaborate, also as PoC), but this doesn't solve one of
> the main problems I have: if there is no way to parse the input into
> natural "units", I'm stuck.
Hmm ... I often wished this were true of LPeg:
lpeg.match(pattern,subject[,init])
pattern - LPeg expressoin
subject - if string, use as input (what we have now)
if function, call function to get input
(much like how load() works)
init - where in input to start
Now, assuming that's the case (it's not, this is just me thinking), let's
create some patterns:
one = P"one"
oneida = P"oneida"
ones = P"ones"
star = P(1)^0
word = oneida + ones + one + star
words = word * (P" " * word)^0
and use a function as the subject of words:match(). First, it returns
"one". We may or may not have a full match, but we can't tell until we read
some more data. Let's say we read "s". Well, that doesn't match "oneida"
so we need to back up and try the next option, so that means that LPeg *has*
to save the existing input because of the way it works (by backtracking and
trying the next pattern). Let's say the next bit the function returns is
"e". So now we're stuck trying to match "onese" ...
It becomes clear that even if LPeg could "suck in" text to parse, that in
the worse case, it could lead to sucking in all the input to satisfy a
pattern, which is probably why PUC designed LPeg the way they did.
> As an example, consider a lua file that returns a large table (few
> thousand lines of code); I can construct a pattern that allows me to
> parse an incomplete fragment (the one that doesn't include a closing
> "}"), but I don't see a way to continue that parsing and correctly
> parse the rest of the table *including the closing "}"*, which may now
> be available.
>
> More specifically; take two fragments: (1) "return {1, " and (2) "2,
> }". I can fake/assume "}" at the end of the first fragment, but I
> don't see a way to continue with the second fragment using the same
> (let's say, Lua) grammar. The only way I could find was to "fake"
> opening brackets as well by adding "return{" to the beginning of the
> second fragment and then inserting the result into the right place in
> the generated AST, but this looks a bit cumbersome and is likely to be
> error-prone. If I don't find a better way, I'll probably finish and
> test this option.
>
> I can't really add more data and then go back and re-parse from the
> beginning as it's not so much the lack of data as it is the lack of
> time to parse it. I want to limit the time spent on parsing to small
> chunks (up to 50ms) even though the entire text can be parsed in
> 500ms.
>
> I'm open to any other ideas/suggestions...
I thought of another way, but unfortunately do not have a
"proof-of-concept"---I'm just tossing this out there ...
Write the parsing expressions such that at certain spots, you actually
call coroutine.yield(). This can be done in Cmt(), Cf(), or by "pattern /
function()". Then, just run the expression in a coroutine. Placed
judiciously, it should let you parse in smaller time chunks. It's a hack,
but it does allow you to yield the parsing (you can't use debug.sethook(),
as LPeg is its own VM and thus, runs alongside the Lua VM).
-spc