lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


On Tue, Sep 18, 2018 at 07:40:05PM +0600, joy mondal wrote:
> Hi Sean !
> 
> After reading my mail I think I wasn't clear enough.
> 
> How do you deal with situations where you have a matching end character in
> between your match ?
> 
> It particularly problematic for characters such as  " '
> 
> Is there a LPEG general best practice for such cases ?

How would you do it if you wrote a parser by hand? You'd change the set of
characters that you matched when inside a quoted strong, either by (a) using
a new map of allowed characters to consume or (b) calling a new function
which accepted a narrower class of characters.

You do the same thing, logically, with a PEG.

  local blank = P"\n\r \v\t"
  local special = P"\\" + P'"' + blank
  local escaped = P"\\" * C(P(1)) -- explicit capture to elide blackslash
  local literal = (P(1) - special) + escaped
  local simplestring = C(literal^1)
  local quotedstring = P'"' * C((literal + (special - P'"')))^0) * P'"'
  local string = (simplestring + quotedstring)^1

In the above, simplestring captures anything that isn't special (quote,
blackslash or blank). A quotedstring captures a different class of
characters--a superset of simplestring that is everything except an
unescaped quote. (There are many ways to reformulate the above, which
happens to be untested, FWIW.)

Notice the use of predicates: the minus operator creates an expression that
matches the first operand iff the second operand does NOT match, effectively
creating a new character class by subtraction.

And notice the use of ordered choice: PEGs are evaulated left-to-right. The
first matching choice short-circuits the subsequent choices. quotedstring
wouldn't work if we swapped it like so

  P'"' * C(((special - P'"') + literal))^0) * P'"'

because backslashes would be consumed by `special - P'"'` instead of being
captured and transformed by `literal`. Ordered choice is often a convenience
that allows us to be more concise (we don't need to explicitly exclude
blackslash-escaped sequences from the second choice), but sometimes
necessary. (Perhaps there's some deep equivalency between ordered choice and
predicates?)

Predicates and ordered choice are integral to PEGs. It's why they work and
why subexpressions can be composed consistently, permitting LPeg's powerful
capturing features. Extensions to regular expression engines support
zero-width assertions, which are similar to predicates. But ordered choice
is the opposite behavior of regular expressions, which will either (a)
backtrack and choose the "best fit" subexpression or (b) compose in
unpredictable ways when subexpressions match overlapping segments.

Ordered choice also reflects how greediness is subtly different between PEGs
and regular expressions--for regular expressions greediness is a property of
the entire match, whereas for PEGs greediness is a property of each
individual subexpression. If a subexpression's greediness would cause the
entire match to fail, the PEG will simply fail, whereas a regular expression
will behave as-if a subexpression is only as greedy as possible without
preventing a successful match. This effort to make greediness "best fit" is
why when you join two subexpressions in a regular expression you can get
surprising results that seemingly change the matching behavior of a
subexpression.

Once you tweak regular expressions to become as powerful as PEGs, you've
simply reinvented PEGs, and in particular necessarily implemented predicates
and ordered choice.