lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


> I'm not sure if it's possible.  The following (using the 0.12 re module) does kind of work:

Thank you Sean, William, and Peng for your suggestions; I think I
managed to get the results I want, but the solutions requires some
(lengthy) explanation.

First of all, I'll summarize the suggestions and restate the problem
as I now better understand it.

Peng's suggestion was to exclude keywords from garbage: EatenChar =
lpeg.C(lpeg.P(1)) - keywords. It works in some cases, but it also
ignores keywords that are out of context; for example, it will not
parse "do end end".

Sean's solution was to define garbage as  ". ! ('do' / 'end' / expr)",
but it suffers from the same issue as it doesn't except "do" and "end"
out of the context.

William's suggestion was to start with V"Stat" + C((P(1)^0) -
V"Stat"); I tried something similar (adjusted for my grammar), but it
didn't work for things like "do (1) return end" (it was marking
"return" as garbage) and I couldn't understand why.

So I decided to figure out why this simple case is not working, hoping
that it will improve my understanding of LPEG logic. I found Patrick
Donnelly's trace debugger for LPEG and made some tweaks to it to print
properly indented patterns to make it more readable. The debugger has
helped me immensely, so I highly recommend it if you get stuck on your
patterns.

Here is the debugger code with my modifications for better
visualization; I'm including it here as I used the same logic for the
solution, so it's interesting to look at it:

local function lpeg_debug(grammar)
  local level = 0
  for k, p in pairs(grammar) do
    local enter = lpeg.Cmt(lpeg.P(true),
      function(s, p, ...) print((" "):rep(level).."+", k, p,
s:sub(p,p)); level = level + 1 return p end)
    local leave = lpeg.Cmt(lpeg.P(true),
      function(s, p, ...) level = level - 1; print(("
"):rep(level).."-", k, p) return p end) * (lpeg.P(1) - lpeg.P(1))
    if k ~= 1 then
      grammar[k] = lpeg.Cmt(enter * p + leave, function(s, p, ...)
level = level - 1; print((" "):rep(level).."=", k, p, s:sub(1, p-1))
return p end)
    end
  end
  return grammar
end

The main component of the debugger is to wrap each pattern "p" into
lpeg.Cmt(enter * p + leave, ...), which will provide debugging for
entering, leaving, and matching the pattern.

Here is the output of this debugger on "do (1)(2) return end":

+ Chunk 1 d
 + Block 1 d
  + Stat 1 d
   + ExprStat 1 d
   - ExprStat 1
   + DoStat 1 d
    + Block 4 (
     + Stat 4 (
      + ExprStat 4 (
      = ExprStat 7 do (1)
     = Stat 7 do (1)
     + Stat 7 (
      + ExprStat 7 (
      = ExprStat 11 do (1)(2)
     = Stat 11 do (1)(2)
     + Stat 11 r
      + ExprStat 11 r
      - ExprStat 11
      + DoStat 11 r
      - DoStat 11
     - Stat 11

This is where it tries to match "r" from "return", but because it
doesn't match any of the "Stat" alternatives, it marks it as garbage,
but it's not garbage as it's actually part of the DoStat block that is
higher in the stack, so if I can make it to fail here and backtrack,
it may succeed on checking "return" where it needs to be.

I initially added "level" to the debugger for better visualization of
the pattern processing, but after looking at the results, I realized
that I can probably use the same logic for garbage processing; it just
needs to be applied when the level is 1 (and ignore keywords if it's
not 1).

Here is the new logic for the unknown() handler:

local keywords = lpeg.P("return") + "end" + "else" + "elseif" -- limited set
local function unknown(p)
  local level = 0
  local enter = lpeg.Cmt(lpeg.P(true), function(s, p) level = level +
1 return p end)
  local leave = lpeg.Cmt(lpeg.P(true), function(s, p) level = level -
1 return p end)
  * (lpeg.P(1) - lpeg.P(1))
  return lpeg.Cmt(enter * p
    + lpeg.Cmt(lpeg.C(lpeg.P(1)), function(s, p, ...) return level ==
1 or not lpeg.match(keywords,s,p-1) end)
    + leave, function(s, p, ...) level = level - 1 return p end)
end

It looks a bit convoluted, but it's actually exactly the same logic as
is used in the debugger, plus "lpeg.Cmt(lpeg.C(lpeg.P(1)), function(s,
p, ...) return level == 1 or not lpeg.match(keywords,s,p-1) end)",
which "eats" one character unless it's a keywords matched on higher
than the first level. I resorted to using lpeg.match a I couldn't
figure out how to incorporate the check into the pattern itself.

So far I haven't been able to find a fragment that can't be parsed by
this logic and it's been handling the well-formatted strings
correctly.

> I haven't been able to get both the first and the last example to parse
> correctly, and I've tried just about every combination (for example,
> "garbage <- [^%s]+") I can think of to no avail.

I tested on your test cases and it seems to correctly handle all four
of them. If you find issues with this solution, please let me know.
Since this is all new to me, I may be completely wrong with my
explanations.

> Personally (and in my
> not-so-humble opinion) trying to carry on in the presence of garbage can
> only lead to trouble (witness the horrors of parsing arbitrary HTML).

I agree, but this is mostly for the purpose of code editing in the
IDE, which has to deal with incremental changes to incomplete and
incorrect code fragments as much as possible.

I'm not including the complete example as the post is already too
long, but will put it somewhere online if someone is interested to see
it.

Paul.