[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

• Subject: Re: LPEG for dummies
• From: Pierre-Yves Gérardy <pygy79@...>
• Date: Thu, 23 May 2013 04:08:50 +0200

On Tue, May 7, 2013 at 5:39 AM, Dirk Laurie <dirk.laurie@gmail.com> wrote:
> 2013/5/1 Dirk Laurie <dirk.laurie@gmail.com>:
>
>> The current version, incorporating all those points, is available at
>>
>>   https://github.com/dlaurie/lua-notes
>
> 1. I have added a case study with about 60 lines of real tested code
>     that implements a toy-scale APL-to-Lua compiler.
> 2. All the LPEG that I actually used, and some more, has been
>     included in the primer. That's almost all of it except Cg, Cb and
>     re.lua.

Here's the groups tutorial:

We'll first get the easy one out of the way: named groups inside table captures.

Ct(Cc"foo" * Cg(Cc"bar" * Cc"baz", "label")* Cc"qux"):match""
--> { "foo", "qux", label = "bar" }

In a table capture, the value of the first capture inside the group
("bar") is assigned to the corresponding key ("label") in the table.
As you can see, C"baz" got lost in the process. the key must be a
string (or a number that will be automatically tostring()ed).

Note that the group must be an immediate child of the table:

Ct(C(Cg(1,"foo"))):match"a"
--> {"a"}

--------

For the rest of this tutorial, we must first explore a subtelty in the
way captures handle their subcaptures.

Some captures operate on the values produced by their subcaptures,
while others operate on the capture objects. This is sometimes
counter-intuitive.

Let's take the following pattern:

(1 * C( C"b" * C"c" ) * 1):match"abcd"
--> "bc", "b", "c"

As you can see, it inserts three values in the capture stream.

Let's wrap it in a table capture:

Ct(1 * C( C"b" * C"c" ) * 1):match"abcd"
--> { "bc", "b", "c" }

Ct operates on values. In the last example, the three values that are
inserted in order in the table.

Now, let's try a substitution capture:

Cs(1 * C( C"b" * C"c" ) * 1):match"abcd"
--> "abcd"

Cs operates on captures. It scans the first level of its nested
captures, and only takes the first value of each one. In the above
example, "b" and "c" are thus discarded. Here's another example that
may make things more clear:

function the_func (bcd)
assert(bcd == "bcd")
return "B", "C", "D"
end

Ct(1 * ( C"bcd" / the_func ) * 1):match"abcde"
--> {"B", "C", "D"}  -- All values are inserted.

Cs(1 * ( C"bcd" / the_func ) * 1):match"abcde"
--> "aBe"   -- the "C" and "D" have been discarded.

A more detailed account of the by value / by capture behaviour of each
kind of capture will be the topic of another post.

--------

Another important thing to realise is that most captures shadow their
subcaptures, but some don't. As you can see in the last example, the
value of C"bcd" is passed to the /function capture, but it doesn't end
in the final capture list. Ct and Cs are also opaque in this regard.
They only produce, resectively, one table or one string.

On the other hand, C() is transparent. As we've seen above, the
subcaptures of C() are also inserted in the stream.

C(C"b" * C"c"):match"bc" --> "bc", "b", "c"

The only transparent captures are C() and the anonymous Cg(). The (
named Cg() / Cb() ) couple has a similar behaviour, but the values are
teleported, and the values captured in the named Cg() ends up inserted
in the stream at the place of the Cb() (more on this later).

Here are a few examples for the anonymous groups:

(1 * Cg(C"b" * C"c" * C"d") * 1):match"abcde"
--> "b", "c", "d"

Ct(1 * Cg(C"b" * C"c" * C"d") * 1):match"abcde"
--> { "b", "c", "d" }

Cs(1 * Cg(C"b" * C"c" * C"d") * 1):match"abcde"
--> "abe" -- "c" and "d" are dropped.

This may seem baffling... But there's one case where this behavior is useful:
in folding captures. Let's write a very basic calculator, that adds or
substracts one digit numbers.

function calc(a, op, b)
a, b = tonumber(a), tonumber(b)
if op == "+" then
return a + b
else
return a - b
end
end

Cf(
C(R"09") *
Cg( C(S"+-") * C(R"09") )^0
, calc
):match"1+2-3+4"

The capture tree will look like this [*]:

{"Cf", func = calc, children = {
{"C", val = "1"},
{"Cg", children = {
{"C", val = "+"},
{"C", val = "2"}}},
{"Cg", children = {
{"C", val = "-"},
{"C", val = "3"}}},
{"Cg", children = {
{"C", val = "+"},
{"C", val = "4"}}}}}

You probably see where this is going... Like Cs, Cf operates on
capture objects. It will first extract the first value of the first
capture, and use it as the initial value. If there are no more
captures, this value becomes the value of the Cf().

But we have more captures! In our case, it will pass all the values of
the second capture (the group) to calc(), tacked after the value of
the first one. Here's the evaluation of the above Cf()

first_arg = "1"
next_ones: "+", "2"
first_arg = calc("1", "+", "2") -- 3, calc() returns numbers

next_ones: "-", "3"
first_arg = calc(3, "-", "3")

next_ones: "+", "4"
first_arg = calc(0, "+", "4")

return first_arg -- Tadaaaa.

[*] Actually, at match time, the capture objects only store their
bounds and auxiliary data (like calc() for the Cf). The actual values
are produced sequencially after the match has completed, but, it makes
things more clear as displayed above. In the above example, the values
of the nested C() and Cg(C(),C()) are actually produced one at a time,
at each corresponding cycle of the folding process.

--------

Now let's do a named Cg:

( 1 * Cg(C"b" * C"c", "l@bel") * C"d" * 1 * Cb"l@bel" ):match"abcde"
-- > "d", "b", "c"    -- Warp!

( 1 * Cg(C"b" * C"c" * C"d", "t@g") * C(1) * Ct(Cb"t@g") ):match"abcde"
--> "e", { "b", "c", "d" }

Cb"tag" will look back for a corresponding Cg that has succeeded. It
goes back and up in the tree, and consumes captures. In other words,
it searches its elder siblings, and the elder siblings of its parents,
but not the parents themselves. Neither does it test the children of
the siblings/siblings of ancestors.

It proceeds as follows (start from the [ #### ] and follow the numbers back up).

The numbered captures are the captures that are tested. The ones marked with
[ ** ] are not, for the various reasons listed. This is hairy, but
AFAICT complete.

Cg(-- [ ** ] ... This one would have been seen,
-- if the search hadn't sopped at *the one*.
"Too late, mate."
, "~@~"
)

* Cg( -- [ 3 ] The search ends here. <--------------[[ Stop ]]
"This is *the one*!"
, "~@~"
)

* Cg(--  [ ** ] ... The great grand parent.
-- Cg with the right tag, but direct ancestor,
-- thus not checked.

Cg( -- [ 2 ] ... Cg, but not the right tag. Skipped.
Cg( -- [ ** ] good tag but masked by the parent (whatever its type)
"Masked"
, "~@~"
)
, "BADTAG"
)

* C( -- [ ** ] ... grand parent. Not even checked.

(
Cg( -- [ ** ] ... Failure after the fact. the group
-- is removed from the capture tree, and will not
-- be seen.
"full of FAIL"
, "~@~"
)
* false
)

+ Cmt(  -- [ ** ] ... Direct parent. Skip.
C(1) -- [ 1 ] ... Not a Cg

* Cb"~@~"   -- [ #### ]  <----------------- [[ START HERE ]] --
, function(subject, index, cap1, cap2)
return assert(cap2 == "This is *the one*!")
end
)
)
, "~@~" -- [ ** ] goes with the great grand parent.
)

------------------

That was a big one. Off to bed now, it's 4 am. I hpe there aren't too
many typos...

-- Pierre-Yves