lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


On Wednesday, October 23, 2013, Sean Conner wrote:

  Okay, I may not fully understand back captures in LPeg.  Here's the
problem:  I'm attempting to parse NAPTR DNS records.  Once I obtain a given
record, I have a string in the form of: [1]

!^.*$!pstndata:cnam/+15714344048;;charset=us-ascii;ds=local;score=98,gn=CRYSTA;sn=SPERBER!

(this is the regexp portion of the NAPTR record).  RFC-3402 gives the
following grammar for this field:

        subst-expr   = delim-char  ere  delim-char  repl  delim-char  *flags
        delim-char   = "/" / "!" / <Any octet not in 'POS-DIGIT' or 'flags'>
                           ; All occurrences of a delim_char in a subst_expr
                           ; must be the same character.>
        ere          = <POSIX Extended Regular _expression_>
        repl         = *(string / backref)
        string       = *(anychar / escapeddelim)
        anychar      = <any character other than delim-char>
        escapeddelim = "\" delim-char
        backref      = "\" POS-DIGIT
        flags        = "i"
        POS-DIGIT    = "1" / "2" / "3" / "4" / "5" / "6" / "7" / "8" / "9"

  I have this translated into LPeg:

        DIGIT        = R"09"
        delim_char   = P"!" -- Cb("delim")
        flags        = P"i"
        backref      = P"\\" * DIGIT
        escapeddelim = P"\\" * delim_char
        anychar      = P(1) - delim_char
        string       = (escapeddelim + anychar)^1
        repl         = C((string + backref)^0)
        ere          = C((P(1) - delim_char)^0)
        idelim_char  = Cg(P"/" + P"!" + (P(1) - (DIGIT + flags)),"delim")
        regexp       = Ct(
                            idelim_char
                            * Cg(ere,"re")
                            * delim_char
                            * Cg(repl,"replace")
                            * delim_char
                            * Cg(flags^0,"flags")
                         )

and it works, except for the hardcoded delimeter.  If I leave delim_char as
is, I get the expected data:

        regexp =
        {
          re = "^.*$",
          flags = "",
          replace = "pstndata:cnam/+15714344048;;charset=us-ascii;ds=local;score=98,gn=CRYSTA;sn=SPERBER"
          delim = "!",
        }

But if I try to use a backreference (delim_char = Cb("delim")), it doesn't work:

        regexp =
        {
          [1] = "!",
          [2] = "!",
          replace = "",
          flags = "",
          re = "",
          delim = "!",
        }

  I'm wondering, am I using back references correctly?  The example in the
LPeg documentation [2] is close to what I want, but I'm missing something.
I know I can use Lua's builtin regular expressions to break this string
apart, but I'd rather use LPeg, if only to figure out how to parse this type
of data.

  -spc (Who's really puzzled by this)

[1]     All characters appearing in this work are fictitious. Any
        resemblance to real persons, living or dead, is purely coincidental.

[2]     http://www.inf.puc-rio.br/~roberto/lpeg/


 You need to encapsulate the delim pattern inside of Cb and have the label as the second argument to Cb

Cb (back_ref_char, "deliml") --or whatever the pattern is

This is untested, but that is the gist of it. 

-Andrew