lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


> On 22/10/2009 06:03, Fernando P. García wrote:
>> Based on your experience, may you tell me why LPEG is better than PCRE?
>
> It might be arguable for simple patterns like checking a date format 
> (without checking the validity of the date!) or looking that a string has 
> no spaces, for example.
> But for real parsing (using context and such), a parser (Peg based or 
> something else) is always better than a RE engine... :-)

Even for regular languages, LPEG may be better than PCRE. As an extreme
example, we have email addresses as defined in RFC 822. There is a real
Perl module that validates such addresses using a regular expression
that starts like this:

(?:(?:\r\n)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t]
)+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:
\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(
?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ 
\t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\0
31]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\

and goes on for a total of 75 lines like those.
(Mail::RFC822::Address: regexp-based address validation)

In LPEG we may describe the same pattern with the following syntax:

p = re.compile[[
  address <- <mailbox> / <group>
  group <- <phrase> ":" [#mailbox] ";"
  mailbox <- <addr_spec> / <phrase> <route_addr>
  route_addr <- "<" <route>? <addr_spec> ">"
  route <- (("@" <domain>) (","+ "@" <domain>)*) ":"
  addr_spec <- <local_part> "@" <domain>
  local_part <- <word> ("." <word>)*
  domain <- <sub_domain> ("." <sub_domain>)*
  sub_domain <- <domain_ref> / <domain_literal>
  domain_ref <- <atom>
  domain_literal <- "[" ([^][] / "\" .)* "]"
  phrase <- <word> (","+ <word>)*
  word <- <atom> / <quoted_string>
  atom <- [^] %c()<>@,;:\".[]+
  quoted_string <- '"' ([^"\%nl] / "\" .)* '"'
]]

Note that the above syntax defines a regular language (no recursive
rules).

-- Roberto