lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


Hello! I am nearing the end of my first major Lua project but I have got stuck with Lpeg.

(I have also asked this question on SO but with no response so far. If you play the SO game and want to get some more points you could answer at http://stackoverflow.com/questions/21622316/parsing-a-tex-like-language-with-lpeg ...)

I have managed to produce one grammar which does what I want, but I have been beating my head against this one and not getting far. The idea is to parse a document which is a simplified form of TeX. I want to split a document into:

* Environments, which are \begin{cmd} and \end{cmd} pairs.

* Commands which can either take an argument like so: \foo{bar} or can be bare: \foo.

* Both environments and commands can have parameters like so:
  \command[color=green,background=blue]{content}.

Other stuff.

I also would like to keep track of line number information for error handling purposes. Here's what I have so far:

lpeg = require("lpeg")
lpeg.locale(lpeg)
-- Assume a lot of "X = lpeg.X" here.

-- Line number handling from http://lua-users.org/lists/lua-l/2011-05/msg00607.html
-- with additional print statements to check they are working.
local newline = P"\r"^-1 * "\n" / function (a) print("New"); end
local incrementline = Cg( Cb"linenum" )/ function ( a ) print("NL"); return a + 1 end , "linenum"
local setup = Cg ( Cc ( 1) , "linenum" )
nl = newline * incrementline
space = nl + lpeg.space

-- Taken from "Name-value lists" in http://www.inf.puc-rio.br/~roberto/lpeg/
local identifier = (R("AZ") + R("az") + P("_") + R("09"))^1
local sep = lpeg.S(",;") * space^0
local value = (1-lpeg.S(",;]"))^1
local pair = lpeg.Cg(C(identifier) * space ^0 * "=" * space ^0 * C(value)) * sep^-1
local list = lpeg.Cf(lpeg.Ct("") * pair^0, rawset)
local parameters = (P("[") * list * P("]")) ^-1

-- And the rest is mine

anything = C( (space^1 + (1-lpeg.S("\\{}")) )^1) * Cb("linenum") / function (a,b) return { text = a, line = b } end

begin_environment = P("\\begin") * Ct(parameters) * P("{") * Cg(identifier, "environment") * Cb("environment") * P("}") / function (a,b) return { params = a[1], environment = b } end
end_environment = P("\\end{") * Cg(identifier) * P("}")

texlike = lpeg.P{
  "document";
  document = setup * V("stuff") * -1,
stuff = Cg(V"environment" + anything + V"bracketed_stuff" + V"command_with" + V"command_without")^0,
  bracketed_stuff = P"{" * V"stuff" * P"}" / function (a) return a end,
command_with =((P("\\") * Cg(identifier) * Ct(parameters) * Ct(V"bracketed_stuff"))-P("\\end{")) / function (i,p,n) return { command = i, parameters = p, nodes = n } end, command_without = (( P("\\") * Cg(identifier) * Ct(parameters) )-P("\\end{")) / function (i,p) return { command = i, parameters = p } end, environment = Cg(begin_environment * Ct(V("stuff")) * end_environment) / function (b,stuff, e) return { b = b, stuff = stuff, e = e} end
}


It almost works!

> texlike:match("\\foo[one=two]thing\\bar")
{
  command = "foo",
  parameters = {
    {
      one = "two",
    },
  },
}
{
  line = 1,
  text = "thing",
}
{
  command = "bar",
  parameters = {
  },
}

But! First, I can't get the line number handling part to work at all. The function within incrementline is never fired.

I also can't quite work out how nested capture information is passed to handling functions (which is why I have scattered Cg, C and Ct semirandomly over the grammar). This means that only one item is returned from within a command_with:

> texlike:match("\\foo{text \\command moretext}")
{
  command = "foo",
  nodes = {
    {
      line = 1,
      text = "text ",
    },
  },
  parameters = {
  },
}

I would also love to be able to check that the environment start and ends match up but when I tried to do so, my back references from "begin" were not in scope by the time I got to "end". I don't know where to go from here.