[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Memory management in LPEG
- From: Miles Bader <miles@...>
- Date: Mon, 23 Apr 2012 09:28:06 +0900
When I parse my very-big-files with LPEG, I currently do it in a loop,
like:
pattern = ONE_STATEMENT
pos = 1
repeat
pos = pattern:match(string, pos)
until not pos
to cover all of string, which be very big (500MB).
I tried changing it to:
pattern = ONE_STATEMENT^0
final_pos = pattern:match(string, pos)
but this caused my program to consumed _much_ more memory (several GB
more). Note that this is with Lua 5.1.5, which has demonstrated
pretty good GC behavior so far.
I suppose this is because LPEG is accumulating references to
substrings or captures, or something, but at first glance, it seems
this isn't really necessary -- since all captures in ONE_STATEMENT are
"finalized" each time around the loop (note there's no final anchor),
it doesn't really seem necessary for LPEG to retain any references to
anything.
[So... is this a possible future optimization in LPEG?]
Anyway, going beyond that, I started to wonder about the following
grammar:
pattern = lpeg.P"begin" * ONE_STATEMENT^0 * lpeg.P"end"
final_pos = pattern:match(string, pos)
In _this_ case, because of the final "end", LPEG _won't_ finalize
captures in ONE_STATEMENT each time around, it will accumulate them
until the end. So the "huge memory use" I'm seeing would maybe be
unavoidable.
However I think that in scenarios like this it would often be
desirable to have those captures fire earlier to allow the LPEG
resources to be reclaimed, and handle a final mismatch by other means
(probably an error message and abort... :).
So I'm thinking: How hard would it be to have some sort of "flush"
operator in LPEG, that basically says "act as if this pattern and all
enclosing patterns successfully matched insofar as required to reclaim
all memory used to store captures or whatever.
E.g. the above grammar would be changed to something like:
pattern = lpeg.P"begin" * lpeg.flush(ONE_STATEMENT)^0 * lpeg.P"end"
final_pos = pattern:match(string, pos)
which would essentially behave as if one had written:
beg_pat = lpeg.P"begin"
one_pat = ONE_STATEMENT
end_pat = lpeg.P"end"
pos = beg_pat:match (string, 1)
while true do
local new_pos = one_pat:match (string, pos)
if new_pos then
pos = new_pos
else
pos = end_pat:match (string, pos)
break
end
end
you can see that even for this very simple grammar, the LPEG version
is much more concise... :]
Thanks,
-miles
--
"Yorton, Wressle, and Gospel Oak, the richness of your heritage is ended.
We shall not stop at you again; for Dr Beeching stops at nothing."