[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Re: [LPeg] How can I parse a subset of markdown?
- From: Sean Conner <sean@...>
- Date: Fri, 22 Jul 2016 19:13:30 -0400
It was thus said that the Great Soni L. once stated:
> http://stackoverflow.com/q/38514522/3691554
>
> I'm trying to parse a subset of markdown into a tree with LPeg. The idea
> is simple but I'm not sure what I'm doing. The whole spec for the thing
> I'm doing is here[1] and yes, that's a master branch github link, there
> are still some things I need to work out.
I'm not exactly sure how you want the resulting table to look like, but
going from this minimal example [1]:
#Tag
##Attribute
###Value
Content
an initial stab at the problem (untested):
local lpeg = require "lpeg"
local Carg = lpeg.Carg
local C = lpeg.C
local P = lpeg.P
local R = lpeg.R
-- ----------------------------------------------------------------
-- Match $0A or $5C $6E, given we can insert "virtual newlines" in
-- the input. This handles that case (this is the only place we
-- handle escaping, but this is a minimal example). There are
-- probably better ways to handle this, left as an exercise for the
-- reader.
-- ----------------------------------------------------------------
local nl = P"\n" + P"\\n"
-- ----------------------------------------------------------------
-- Content. This is defined as the first character any printable
-- character other than a '#', followed by any number of printable
-- characters (including tabs and '#').
-- ----------------------------------------------------------------
local content = R(' "',"$~") * R("\t\t"," ~")^0 * nl
-- ----------------------------------------------------------------
-- A "Tag" is defined as starting with a single '#' on a line,
-- followed by a name. We're being lax here in that we accept
-- everything up to a newline or virtual newline. This means that:
--
-- #A#tag is this#woot!
--
-- is considered valid. This way, we can reuse the definition of
-- content (I'm being lazy here).
-- ----------------------------------------------------------------
local tag = P"#" * content * nl
-- ----------------------------------------------------------------
-- "Attribute" is defined simularly, only with two leading '#'
-- marks. See how easy this is?
-- ----------------------------------------------------------------
local attribute = P"##" * content * nl
-- ----------------------------------------------------------------
-- Again with the "Value".
-- ----------------------------------------------------------------
local value = P"###" * content * nl
-- ----------------------------------------------------------------
-- Okay, entry time. I'm assuming that an entry is a tag,
-- optionally followed by one or more attribute and value pairs
-- followed by actual content lines.
--
-- I am using the first extra parameter to lpeg.match(), a table, as
-- a way to collect the results of parsing. That extra parameter is
-- used here. We first collect the tag (and the extra argument) and
-- pass that to a function to accumlate the new tag. Then we loop
-- over possible atttibute/value pairs and accumulate those into the
-- table, and finally the content lines.
-- ----------------------------------------------------------------
local entry = (Carg(1) * C(tag))
/ function(t,tag)
local x = { [0] = tag }
table.insert(t,x)
return t
end
* (
(Carg(1) * C(attribute) * C(value))
/ function(t,a,v)
local x = t[#t]
x[a] = v
return t
end
)^0
* (
(Carg(1) * C(content))
/ function(t,c)
local x = t[#t]
table.insert(x,c)
return t
end
)^0
-- ----------------------------------------------------------------
-- A "doc" is zero or more entries.
-- ----------------------------------------------------------------
local doc = entry^0
-- ----------------------------------------------------------------
-- Here we parse some data. We pass in an initially empty table as
-- the first extra parameter, which is used to accumulate data.
-- ----------------------------------------------------------------
result = {}
doc:match(data,1,result)
I opted to store the "tag" as the [0]th element because that's what LuaXML
does when parsing XML documents. This should get you going though (other
things left as an exercise---what if there's a missing tag? Adding in
escape sequences. That odd 'raw' mode I didn't understand. Parsing nested
data)
-spc
[1] And I'm wondering why you even want this, when you could just use
Lua directly, or JSON, or YAML, or *any number of existing
half-documented markup languages masquerading as a "standard"* but
I'll take you at face value and not ask WTF?