[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Re: HTML Parser Recommendation
- From: William Ahern <william@...>
- Date: Fri, 24 May 2013 17:08:38 -0700
On Fri, May 24, 2013 at 11:08:30AM +0200, steve donovan wrote:
> On Fri, May 24, 2013 at 10:42 AM, Rob Kendrick <rjek@rjek.com> wrote:
>
> > One of the strengths of libhubbub is that it parses good and bad HTML.
> > It follows the HTML5 spec, which essentially specificies how MSIE parses
> > broken HTML.
> >
>
> It must have lots of if statements ;)
>
> Which is precisely why it sounds like a good industrial solution; I don't
> doubt it can be done well with LPeg but there would need to be hundreds of
> tests before it would be a contender.
I just finished writing a complete tokenizer in C as an almost direct
transliteration of the HTML5 tokenizing rules. I'm confident that it can't
be done with LPeg, not if you want to be fully standards compliant and
handle pathological cases, such as spammers might abuse.
I was originally going to use Hubbub, but I needed usable APIs at more than
the tree level, to passively scan for URLs within fixed memory space,
regardless of document size. The scanning of tokens--actually characters
expanded into a tagged alphabet, not strutured tokens per se--uses a
separate state machine to pick out URLs and keywords.
I don't have to deal with JavaScript just yet because I'm scanning HTML
e-mail, but with JavaScript it's provably impossible to use any generalized
tool as tokenization is tied into the run time model. Global
document.write() statements actually generate tokens _inline_. For example
<script>
document.write("&");
</script>amp;
actually generates the entity &, which must be properly tokenized prior
to the tree building phase. It's sick... sick... sick....