[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Re: HTML Parser Recommendation
- From: William Ahern <william@...>
- Date: Fri, 24 May 2013 18:19:17 -0700
On Sat, May 25, 2013 at 02:30:13AM +0200, Pierre-Yves G??rardy wrote:
> On Sat, May 25, 2013 at 2:08 AM, William Ahern
> <william@25thandclement.com> wrote:
> > I'm confident that it can't
> > be done with LPeg, not if you want to be fully standards compliant and
> > handle pathological cases, such as spammers might abuse.
>
> Even if you use match-time captures? If I understand properly, they
> make LPeg Turing complete.
>
It might be theoretically feasible, but I'll believe it when I see it. It's
far easier to use Lua to write a traditional lexer and parser for HTML5. The
specification literally describes a straightforward state machine (multiple
machines, actually), and the easiest way to ensure you're compliant is to
write your code similarly.
The complexity goes far beyond handling tag closure. HTML5 describes
placement of syntactically correct but wrongly nested tags. Imagine a
scenario of nested mis-nested tags; a pure LPeg parser would just be a
nightmare of dynamic matching and capture manipulation, and so not worth the
effort.
LPeg might practically get you 80% of the way there, though. That's good
enough for most people, especially because most sites will rarely encounter
or be required to handle all the egregious cases.