Re: LPeg indent/whitespace based grammar

lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

Subject: Re: LPeg indent/whitespace based grammar
From: Duncan Cross <duncan.cross@...>
Date: Fri, 7 Jul 2017 15:38:44 +0100

On Fri, Jul 7, 2017 at 5:23 AM, Matthias Dörfelt <lists@mokafolio.de> wrote:
> While my implementation works, I feel like I am circumventing a lot of the
> actual capturing mechanisms of LPeg in order to deal with back-referencing
> to build the hierarchy. I’d like to get some feedback on weather there are
> any better ways to approach this kind of grammar using LPeg. I am new to it
> in general so I might be missing something obvious here.

Here's how I would approach it.

First, some helper patterns:

--
local initial_indent = lpeg.Cg(lpeg.P(''), 'indent')

local match_same_indent = lpeg.Cmt(
  lpeg.Cb('indent'),
  function(s, p, indent)
    if s:sub(p, p + #indent - 1) ~= current_indent then
      return false
    end
    return p + #indent
  end)

local capture_deeper_indent = lpeg.Cg(
  match_same_indent * lpeg.S' \t'^1,
  'indent')
--

initial_indent captures an empty string and stores it as the named
group "indent".

match_same_indent checks that the immediate next set of characters is
identical to the current contextual value of the named group "indent".

capture_deeper_indent requires that the current indent matches at this
position, and is immediately followed by at least one more whitespace
character. If it succeeds, the full new indent is stored as a new
named group capture for "indent".

To create the actual hierarchy document grammar, I would turn to the
"re" module, passing in the helper patterns via the "defs" table
parameter:

--
local hierarchy = re.compile([[

  hierarchy <- %initial_indent {|
    {:children: {| (<first_root> <next_root>*)? |} :}
  |} !.

  first_root <- <entry>
  next_root <- %nl <entry>

  entry <- {|
    {:name: <name> :}
    {:children: {| (<first_child> <next_child>*)? |} :}
  |}

  first_child <- %nl %capture_deeper_indent <entry>
  next_child <- %nl %match_same_indent <entry>

  name <- (!%nl .)+

]], {
  initial_indent = initial_indent;
  match_same_indent = match_same_indent;
  capture_deeper_indent = capture_deeper_indent;
})
--

To explain the elements of this grammar in English:

<hierarchy>:
  - capture the empty string as an initial base value for "indent"
  - create a table with a "children" sub-table
  - populate "children" with either no values, or <first_root>
followed by zero or more <next_root>s
  - make sure there is no more input

<first_root>: same as <entry>

<next_root>: newline character followed by <entry>

<entry>: a table containing
  - a <name> captured as field "name"
  - a sub-table "children" containing either no values, or
<first_child> followed by zero or more <next_child>ren

<first_child>:
  - newline character
  - capture a new, deeper indent than the current contextual indent
  - <entry>

<next_child>:
  - newline character
  - match the current contextual indent
  - <entry>

<name>: one or more non-newline characters

A couple of quirks to note:

1. This implementation does not enforce indentation in sets of four
spaces, but instead any arbitrary set of spaces and tabs. Hopefully it
should not be too difficult to modify capture_deeper_indent to be
strict about this, if you want.

2. The "indent" value ends up retained as a field inside "children"
tables in the final result. For example:

--
print(hierarchy:match('a\n    b').children[1].children.indent)
--

....will print four spaces. If this is an undesirable side-effect, a
function capture could be used to remove this field once the
"children" table is fully captured.

-Duncan

Follow-Ups:
- Re: LPeg indent/whitespace based grammar, Matthias Dörfelt
- Re: LPeg indent/whitespace based grammar, Matthias Dörfelt

References:
- LPeg indent/whitespace based grammar, Matthias Dörfelt
- Re: LPeg indent/whitespace based grammar, Matthias Dörfelt

Prev by Date: [OT] Re: LPeg indent/whitespace based grammar
Next by Date: Re: "String builder"-style optimization
Previous by thread: [OT] Re: LPeg indent/whitespace based grammar
Next by thread: Re: LPeg indent/whitespace based grammar
Index(es):
- Date
- Thread