Re: [LPeg] How can I parse a subset of markdown?

lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

Subject: Re: [LPeg] How can I parse a subset of markdown?
From: Sean Conner <sean@...>
Date: Fri, 22 Jul 2016 19:13:30 -0400

It was thus said that the Great Soni L. once stated:
> http://stackoverflow.com/q/38514522/3691554
> 
> I'm trying to parse a subset of markdown into a tree with LPeg. The idea 
> is simple but I'm not sure what I'm doing. The whole spec for the thing 
> I'm doing is here[1] and yes, that's a master branch github link, there 
> are still some things I need to work out.

  I'm not exactly sure how you want the resulting table to look like, but
going from this minimal example [1]:

	#Tag
	##Attribute
	###Value
	Content

an initial stab at the problem (untested):

	local lpeg = require "lpeg"
	local Carg = lpeg.Carg
	local C    = lpeg.C
	local P    = lpeg.P
	local R    = lpeg.R

	-- ----------------------------------------------------------------
	-- Match $0A or $5C $6E, given we can insert "virtual newlines" in
	-- the input.  This handles that case (this is the only place we
	-- handle escaping, but this is a minimal example).  There are
	-- probably better ways to handle this, left as an exercise for the
	-- reader.
	-- ----------------------------------------------------------------
	
	local nl = P"\n" + P"\\n"
	
	-- ----------------------------------------------------------------
	-- Content.  This is defined as the first character any printable
	-- character other than a '#', followed by any number of printable
	-- characters (including tabs and '#').
	-- ----------------------------------------------------------------

	local content = R(' "',"$~") * R("\t\t"," ~")^0 * nl
	
	-- ----------------------------------------------------------------
	-- A "Tag" is defined as starting with a single '#' on a line,
	-- followed by a name.  We're being lax here in that we accept
	-- everything up to a newline or virtual newline.  This means that:
	--
	--	#A#tag is this#woot!
	--
	-- is considered valid.  This way, we can reuse the definition of
	-- content (I'm being lazy here).
	-- ----------------------------------------------------------------
	
	local tag = P"#" * content * nl
	
	-- ----------------------------------------------------------------
	-- "Attribute" is defined simularly, only with two leading '#'
	-- marks.  See how easy this is?
	-- ----------------------------------------------------------------
	
	local attribute = P"##" * content * nl
	
	-- ----------------------------------------------------------------
	-- Again with the "Value".
	-- ----------------------------------------------------------------
	
	local value = P"###" * content * nl
	
	-- ----------------------------------------------------------------
	-- Okay, entry time.  I'm assuming that an entry is a tag,
	-- optionally followed by one or more attribute and value pairs
	-- followed by actual content lines.
	--
	-- I am using the first extra parameter to lpeg.match(), a table, as
	-- a way to collect the results of parsing.  That extra parameter is
	-- used here.  We first collect the tag (and the extra argument) and
	-- pass that to a function to accumlate the new tag.  Then we loop
	-- over possible atttibute/value pairs and accumulate those into the
	-- table, and finally the content lines.
	-- ----------------------------------------------------------------
	
	local entry = (Carg(1) * C(tag))
	            / function(t,tag)
	                local x = { [0] = tag }
	                table.insert(t,x)
	                return t
	              end
	            * (
	                (Carg(1) * C(attribute) * C(value))
	                /  function(t,a,v)
	                     local x = t[#t]
	                     x[a] = v
	                     return t	                     
	                   end
	              )^0
	            * (
	                (Carg(1) * C(content))
	                /  function(t,c)
	                     local x = t[#t]
	                     table.insert(x,c)
	                     return t
	                   end
	              )^0
	              
	-- ----------------------------------------------------------------
	-- A "doc" is zero or more entries.
	-- ----------------------------------------------------------------
	
	local doc = entry^0

	-- ----------------------------------------------------------------
	-- Here we parse some data.  We pass in an initially empty table as
	-- the first extra parameter, which is used to accumulate data.
	-- ----------------------------------------------------------------
	
	result = {}
	doc:match(data,1,result)

  I opted to store the "tag" as the [0]th element because that's what LuaXML
does when parsing XML documents.  This should get you going though (other
things left as an exercise---what if there's a missing tag?  Adding in
escape sequences.  That odd 'raw' mode I didn't understand.  Parsing nested
data)

  -spc
  
[1]	And I'm wondering why you even want this, when you could just use
	Lua directly, or JSON, or YAML, or *any number of existing
	half-documented markup languages masquerading as a "standard"* but
	I'll take you at face value and not ask WTF?

Follow-Ups:
- Re: [LPeg] How can I parse a subset of markdown?, Soni L.

References:
- [LPeg] How can I parse a subset of markdown?, Soni L.

Prev by Date: Re: Let's talk about __call
Next by Date: Re: [LPeg] How can I parse a subset of markdown?
Previous by thread: Re: [LPeg] How can I parse a subset of markdown?
Next by thread: Re: [LPeg] How can I parse a subset of markdown?
Index(es):
- Date
- Thread