lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

On Sunday, 7 August 2011, David Hollander <> wrote:
>> I use them both in my little web-crawling utility module WDM [1]
> I see you are using Roberto's XML parser as a base, which is a strict
> parser that raises errors on improperly formatted XML?
> A problem I ran into last week is that the HTML spec is a bit
> different than XML[1], unless the webpage is specifically using an
> XHTML doctype, and many websites had html errors on top of that.

To deal with that issue, you can optionally use the html-tidy binding through the toTidy() function. It returns the same table format as toXml(), and also tries to clean up the source through htmltody beforehand. The source is at in the 'mk' branch.

WDM stores saved pages locally in a cache directory, so you can experiment without downloading things multiple times. These can be compressed if the bz2 library is available. You can find it at .