Re: Web crawling in Lua

lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

Subject: Re: Web crawling in Lua
From: Justin Cormack <justin@...>
Date: Mon, 8 Aug 2011 08:27:54 +0100

On 7 Aug 2011, at 20:52, David Hollander <dhllndr@gmail.com> wrote:

> Hmm If there are no close tags in entire page it would list them in top Dom. To reconstruct that I'd need a table check of elements whose parent must be X (would work for nonclosed table rows/cells too which could be a more common instance of this mistake) or do reverse behavior and have table of elements that must be empty. I didn't need such error correction at the time but it could still be done in one pass if a rule check about HTML spec added
> 

How to parse html has been usefully formalised as part of the html5 spec and there are some parsers based on this now. It covers all the edge cases and is based on tests of lots of pages. 

Not sure though which version easiest to integrate with Lua...



> Sent from my iPhone
> 
> On Aug 7, 2011, at 11:06 AM, HyperHacker <hyperhacker@gmail.com> wrote:
> 
>> On Sun, Aug 7, 2011 at 09:55, David Hollander <dhllndr@gmail.com> wrote:
>>> Cool! My solution was just to simplify Roberto's Lua function, by
>>> putting all generated nodes on a single stack, and defer nesting nodes
>>> as children until a </close> tag appears, and then iterate from the
>>> top of the stack to find the next matching </close> tag. So
>>> 
>>> <div>
>>>   <br>
>>>  <img class=world src = "hello">
>>>  <span id='stuff''>
>>>   <div></div>
>>>   <input type=checkbox checked>
>>> </div>
>>> 
>>> ..would be valid and put all elements as children of the first <div>.
>>> The debate then is if the inner <div> and <input> should instead be
>>> children of the unfinished <span>. My current interpretation is that
>>> they should not, though I'm not sure which error is more common. Would
>>> need to find more poorly programmed websites, googling for *.aspx
>>> might do the trick ;)
>>> 
>>> On Sun, Aug 7, 2011 at 9:13 AM, Michal Kottman <k0mpjut0r@gmail.com> wrote:
>>>> On Sunday, 7 August 2011, David Hollander <dhllndr@gmail.com> wrote:
>>>>>> I use them both in my little web-crawling utility module WDM [1]
>>>>> 
>>>>> I see you are using Roberto's XML parser as a base, which is a strict
>>>>> parser that raises errors on improperly formatted XML?
>>>>> A problem I ran into last week is that the HTML spec is a bit
>>>>> different than XML[1], unless the webpage is specifically using an
>>>>> XHTML doctype, and many websites had html errors on top of that.
>>>> 
>>>> To deal with that issue, you can optionally use the html-tidy binding
>>>> through the toTidy() function. It returns the same table format as toXml(),
>>>> and also tries to clean up the source through htmltody beforehand. The
>>>> source is at https://github.com/mkottman/tidy/tree/mk in the 'mk' branch.
>>>> 
>>>> WDM stores saved pages locally in a cache directory, so you can experiment
>>>> without downloading things multiple times. These can be compressed if the
>>>> bz2 library is available. You can find it at
>>>> https://github.com/mkottman/lua-bz2 .
>>>> 
>>> 
>>> 
>> 
>> What would it do if it never found a close tag? Say: <html><body>Hello world!
>> 
>> -- 
>> Sent from my toaster.
>> 
>

Follow-Ups:
- Re: Web crawling in Lua, Jeff Pohlmeyer

References:
- Re: Web crawling in Lua, David Hollander
- Re: Web crawling in Lua, Michal Kottman
- Re: Web crawling in Lua, David Hollander
- Re: Web crawling in Lua, HyperHacker
- Re: Web crawling in Lua, David Hollander

Prev by Date: Re: Web crawling in Lua
Next by Date: LuaJIT binaries up on the web
Previous by thread: Re: Web crawling in Lua
Next by thread: Re: Web crawling in Lua
Index(es):
- Date
- Thread