Re: Web crawling in Lua

lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

Subject: Re: Web crawling in Lua
From: David Hollander <dhllndr@...>
Date: Sun, 7 Aug 2011 06:43:33 -0500

> I use them both in my little web-crawling utility module WDM [1]

I see you are using Roberto's XML parser as a base, which is a strict
parser that raises errors on improperly formatted XML?
A problem I ran into last week is that the HTML spec is a bit
different than XML[1], unless the webpage is specifically using an
XHTML doctype, and many websites had html errors on top of that. The
approach I went with was a non-strict HTML parser that always tries to
stick elements somewhere in the DOM, I'll put what I have so far on
the wiki or github later this week.

[1] http://en.wikipedia.org/wiki/HTML_element#Syntax

On Sun, Jul 31, 2011 at 11:51 AM, Michal Kottman <k0mpjut0r@gmail.com> wrote:
> On Sun, 2011-07-31 at 14:41 +0200, Dirk Laurie wrote:
>> The libcurl library documentation lists two sets of Lua bindings to curl:
>>
>> Lua
>>
>>   luacurl by Alexander Marinov
>>   http://luacurl.luaforge.net/
>>
>>   Lua-cURL by Jürgen Hötzel
>>   http://luaforge.net/projects/lua-curl/
>>
>> Comments welcome by someone who has experience of either.
>
> Both have a similar interface. I use them both in my little web-crawling
> utility module WDM [1], so you may take a look there.
>
> The differences essentially are:
>
> LuaCurl:
> - binds only the easy interface
> - initialize with curl.new()
> - passes (userparam, string) to WRITEUNCTION
>
> Lua-cURL:
> - binds also multi/shared API
> - initialize with curl.easy_init()
> - passes only string to WRITEFUNCTION
>
>
> [1] https://github.com/mkottman/wdm/blob/master/wdm.lua
>
>
>
>

Follow-Ups:
- Re: Web crawling in Lua, Michal Kottman
- Re: Web crawling in Lua, Michal Kottman
- Re: Web crawling in Lua, Leo Razoumov

Prev by Date: Re: nCcalls in the global state keeps increasing in 5.2
Next by Date: Re: Server socket problem
Previous by thread: Safe Lua 0.3 is an operating system.
Next by thread: Re: Web crawling in Lua
Index(es):
- Date
- Thread