[Date Prev][Date Next][Thread Prev][Thread Next]
- Subject: Re: Web crawling in Lua
- From: HyperHacker <hyperhacker@...>
- Date: Sun, 7 Aug 2011 10:06:50 -0600
On Sun, Aug 7, 2011 at 09:55, David Hollander <email@example.com> wrote:
> Cool! My solution was just to simplify Roberto's Lua function, by
> putting all generated nodes on a single stack, and defer nesting nodes
> as children until a </close> tag appears, and then iterate from the
> top of the stack to find the next matching </close> tag. So
> <img class=world src = "hello">
> <span id='stuff''>
> <input type=checkbox checked>
> ..would be valid and put all elements as children of the first <div>.
> The debate then is if the inner <div> and <input> should instead be
> children of the unfinished <span>. My current interpretation is that
> they should not, though I'm not sure which error is more common. Would
> need to find more poorly programmed websites, googling for *.aspx
> might do the trick ;)
> On Sun, Aug 7, 2011 at 9:13 AM, Michal Kottman <firstname.lastname@example.org> wrote:
>> On Sunday, 7 August 2011, David Hollander <email@example.com> wrote:
>>>> I use them both in my little web-crawling utility module WDM 
>>> I see you are using Roberto's XML parser as a base, which is a strict
>>> parser that raises errors on improperly formatted XML?
>>> A problem I ran into last week is that the HTML spec is a bit
>>> different than XML, unless the webpage is specifically using an
>>> XHTML doctype, and many websites had html errors on top of that.
>> To deal with that issue, you can optionally use the html-tidy binding
>> through the toTidy() function. It returns the same table format as toXml(),
>> and also tries to clean up the source through htmltody beforehand. The
>> source is at https://github.com/mkottman/tidy/tree/mk in the 'mk' branch.
>> WDM stores saved pages locally in a cache directory, so you can experiment
>> without downloading things multiple times. These can be compressed if the
>> bz2 library is available. You can find it at
>> https://github.com/mkottman/lua-bz2 .
What would it do if it never found a close tag? Say: <html><body>Hello world!
Sent from my toaster.