Re: HTML Parser Recommendation

lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

Subject: Re: HTML Parser Recommendation
From: William Ahern <william@...>
Date: Fri, 24 May 2013 18:19:17 -0700

On Sat, May 25, 2013 at 02:30:13AM +0200, Pierre-Yves G??rardy wrote:
> On Sat, May 25, 2013 at 2:08 AM, William Ahern
> <william@25thandclement.com> wrote:
> > I'm confident that it can't
> > be done with LPeg, not if you want to be fully standards compliant and
> > handle pathological cases, such as spammers might abuse.
> 
> Even if you use match-time captures? If I understand properly, they
> make LPeg Turing complete.
> 

It might be theoretically feasible, but I'll believe it when I see it. It's
far easier to use Lua to write a traditional lexer and parser for HTML5. The
specification literally describes a straightforward state machine (multiple
machines, actually), and the easiest way to ensure you're compliant is to
write your code similarly.

The complexity goes far beyond handling tag closure. HTML5 describes
placement of syntactically correct but wrongly nested tags. Imagine a
scenario of nested mis-nested tags; a pure LPeg parser would just be a
nightmare of dynamic matching and capture manipulation, and so not worth the
effort.

LPeg might practically get you 80% of the way there, though. That's good
enough for most people, especially because most sites will rarely encounter
or be required to handle all the egregious cases.

References:
- HTML Parser Recommendation, Chris Datfung
- Re: HTML Parser Recommendation, Daniel Silverstone
- Re: HTML Parser Recommendation, steve donovan
- Re: HTML Parser Recommendation, Rob Kendrick
- Re: HTML Parser Recommendation, steve donovan
- Re: HTML Parser Recommendation, William Ahern
- Re: HTML Parser Recommendation, Pierre-Yves Gérardy

Prev by Date: Re: HTML Parser Recommendation
Next by Date: Re: HTML Parser Recommendation
Previous by thread: Re: HTML Parser Recommendation
Next by thread: Re: HTML Parser Recommendation
Index(es):
- Date
- Thread