lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


Cool! My solution was just to simplify Roberto's Lua function, by
putting all generated nodes on a single stack, and defer nesting nodes
as children until a </close> tag appears, and then iterate from the
top of the stack to find the next matching </close> tag. So

<div>
    <br>
   <img class=world src = "hello">
   <span id='stuff''>
    <div></div>
    <input type=checkbox checked>
</div>

..would be valid and put all elements as children of the first <div>.
The debate then is if the inner <div> and <input> should instead be
children of the unfinished <span>. My current interpretation is that
they should not, though I'm not sure which error is more common. Would
need to find more poorly programmed websites, googling for *.aspx
might do the trick ;)

On Sun, Aug 7, 2011 at 9:13 AM, Michal Kottman <k0mpjut0r@gmail.com> wrote:
> On Sunday, 7 August 2011, David Hollander <dhllndr@gmail.com> wrote:
>>> I use them both in my little web-crawling utility module WDM [1]
>>
>> I see you are using Roberto's XML parser as a base, which is a strict
>> parser that raises errors on improperly formatted XML?
>> A problem I ran into last week is that the HTML spec is a bit
>> different than XML[1], unless the webpage is specifically using an
>> XHTML doctype, and many websites had html errors on top of that.
>
> To deal with that issue, you can optionally use the html-tidy binding
> through the toTidy() function. It returns the same table format as toXml(),
> and also tries to clean up the source through htmltody beforehand. The
> source is at https://github.com/mkottman/tidy/tree/mk in the 'mk' branch.
>
> WDM stores saved pages locally in a cache directory, so you can experiment
> without downloading things multiple times. These can be compressed if the
> bz2 library is available. You can find it at
> https://github.com/mkottman/lua-bz2 .
>