lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


When I parse my very-big-files with LPEG, I currently do it in a loop,
like:

   pattern = ONE_STATEMENT
   pos = 1
   repeat
     pos = pattern:match(string, pos)
   until not pos

to cover all of string, which be very big (500MB).

I tried changing it to:

   pattern = ONE_STATEMENT^0
   final_pos = pattern:match(string, pos)

but this caused my program to consumed _much_ more memory (several GB
more).  Note that this is with Lua 5.1.5, which has demonstrated
pretty good GC behavior so far.

I suppose this is because LPEG is accumulating references to
substrings or captures, or something, but at first glance, it seems
this isn't really necessary -- since all captures in ONE_STATEMENT are
"finalized" each time around the loop (note there's no final anchor),
it doesn't really seem necessary for LPEG to retain any references to
anything.

[So... is this a possible future optimization in LPEG?]

Anyway, going beyond that, I started to wonder about the following
grammar:

   pattern = lpeg.P"begin" * ONE_STATEMENT^0 * lpeg.P"end"
   final_pos = pattern:match(string, pos)

In _this_ case, because of the final "end", LPEG _won't_ finalize
captures in ONE_STATEMENT each time around, it will accumulate them
until the end.  So the "huge memory use" I'm seeing would maybe be
unavoidable.

However I think that in scenarios like this it would often be
desirable to have those captures fire earlier to allow the LPEG
resources to be reclaimed, and handle a final mismatch by other means
(probably an error message and abort... :).

So I'm thinking:  How hard would it be to have some sort of "flush"
operator in LPEG, that basically says "act as if this pattern and all
enclosing patterns successfully matched insofar as required to reclaim
all memory used to store captures or whatever.

E.g. the above grammar would be changed to something like:

   pattern = lpeg.P"begin" * lpeg.flush(ONE_STATEMENT)^0 * lpeg.P"end"
   final_pos = pattern:match(string, pos)

which would essentially behave as if one had written:

   beg_pat = lpeg.P"begin"
   one_pat = ONE_STATEMENT
   end_pat = lpeg.P"end"
   pos = beg_pat:match (string, 1)
   while true do
     local new_pos = one_pat:match (string, pos)
     if new_pos then
       pos = new_pos
     else
       pos = end_pat:match (string, pos)
       break
     end
   end

you can see that even for this very simple grammar, the LPEG version
is much more concise... :]

Thanks,

-miles

-- 
"Yorton, Wressle, and Gospel Oak, the richness of your heritage is ended.
We shall not stop at you again; for Dr Beeching stops at nothing."