lua-l archive
[
Date Prev
][
Date Next
][
Thread Prev
][
Thread Next
] [
Date Index
] [
Thread Index
]
Subject
:
Re: HTML Parser Recommendation
From
: steve donovan <steve.j.donovan@
...
>
Date
: Fri, 24 May 2013 09:53:27 +0200
On Thu, May 23, 2013 at 4:20 PM, Daniel Silverstone
<
dsilvers@digital-scurf.org
>
wrote:
Parsing HTML is very hard to get right, so you're unlikely to find a pure Lua
version of a parser.
If it's well-formed HTML (and this is often a big 'if') then Penlight's xml module has an HTML mode
local xml = require 'pl.xml'
xml.parsehtml = true
local doc = xml.parse(file,true,true) -- true for 'file not string', true for 'pure Lua parser'
The result is then in luaexpat's LOM format; this can be processed using other functionality of pl.xml like pattern matching.
This mode knows about the HTML elements than can be empty, like <br>. (<p> isn't empty!) and case-insensitivity.
Possibly of marginal use in the crappy world of real-world HTML, but does manage to cope with real pages that keep to the standard.
(I still don't like the fact that 'parsehtml' is basically a global mode but everything is a work in progress)
steve d.
Follow-Ups
:
Re: HTML Parser Recommendation
,
Rob Kendrick
References
:
HTML Parser Recommendation
,
Chris Datfung
Re: HTML Parser Recommendation
,
Daniel Silverstone
Prev by Date:
Re: Help with an algorithm
Next by Date:
Re: HTML Parser Recommendation
Previous by thread:
Re: HTML Parser Recommendation
Next by thread:
Re: HTML Parser Recommendation
Index(es):
Date
Thread