[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Re: LPEG vs. PCRE (was: Can someone convert this POSIX regex to a Lua regex?)
- From: Roberto Ierusalimschy <roberto@...>
- Date: Thu, 22 Oct 2009 15:07:12 -0200
> On 22/10/2009 06:03, Fernando P. García wrote:
>> Based on your experience, may you tell me why LPEG is better than PCRE?
>
> It might be arguable for simple patterns like checking a date format
> (without checking the validity of the date!) or looking that a string has
> no spaces, for example.
> But for real parsing (using context and such), a parser (Peg based or
> something else) is always better than a RE engine... :-)
Even for regular languages, LPEG may be better than PCRE. As an extreme
example, we have email addresses as defined in RFC 822. There is a real
Perl module that validates such addresses using a regular expression
that starts like this:
(?:(?:\r\n)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t]
)+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:
\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(
?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[
\t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\0
31]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\
and goes on for a total of 75 lines like those.
(Mail::RFC822::Address: regexp-based address validation)
In LPEG we may describe the same pattern with the following syntax:
p = re.compile[[
address <- <mailbox> / <group>
group <- <phrase> ":" [#mailbox] ";"
mailbox <- <addr_spec> / <phrase> <route_addr>
route_addr <- "<" <route>? <addr_spec> ">"
route <- (("@" <domain>) (","+ "@" <domain>)*) ":"
addr_spec <- <local_part> "@" <domain>
local_part <- <word> ("." <word>)*
domain <- <sub_domain> ("." <sub_domain>)*
sub_domain <- <domain_ref> / <domain_literal>
domain_ref <- <atom>
domain_literal <- "[" ([^][] / "\" .)* "]"
phrase <- <word> (","+ <word>)*
word <- <atom> / <quoted_string>
atom <- [^] %c()<>@,;:\".[]+
quoted_string <- '"' ([^"\%nl] / "\" .)* '"'
]]
Note that the above syntax defines a regular language (no recursive
rules).
-- Roberto