[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Re: HTML Parser Recommendation
- From: William Ahern <william@...>
- Date: Fri, 24 May 2013 18:30:59 -0700
On Fri, May 24, 2013 at 06:15:55PM -0700, Wesley Smith wrote:
> > I just finished writing a complete tokenizer in C as an almost direct
> > transliteration of the HTML5 tokenizing rules. I'm confident that it can't
> > be done with LPeg, not if you want to be fully standards compliant and
> > handle pathological cases, such as spammers might abuse.
>
> I'd be surprised if this was the case. Do you have a particular
> example in mind?
Just read the specification
http://www.whatwg.org/specs/web-apps/current-work/multipage/tokenization.html
http://www.whatwg.org/specs/web-apps/current-work/multipage/tree-construction.html
Even excluding JavaScript, many of the state transitions and mid-parsing
node fixups are sufficiently complex that the burden should be on the person
claiming it can be done and--more importantly--how. I'm content with the
conjecture that it can't be done in practice using pure LPeg.