lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


On Thu, May 23, 2013 at 4:20 PM, Daniel Silverstone <dsilvers@digital-scurf.org> wrote:
Parsing HTML is very hard to get right, so you're unlikely to find a pure Lua
version of a parser.

If it's well-formed HTML (and this is often a big 'if') then Penlight's xml module has an HTML mode

local xml = require 'pl.xml'
xml.parsehtml = true
local doc = xml.parse(file,true,true) -- true for 'file not string', true for 'pure Lua parser'

The result is then in luaexpat's LOM format; this can be processed using other functionality of pl.xml like pattern matching.

This mode knows about the HTML elements than can be empty, like <br>.  (<p> isn't empty!) and case-insensitivity.

Possibly of marginal use in the crappy world of real-world HTML, but does manage to cope with real pages that keep to the standard.

(I still don't like the fact that 'parsehtml' is basically a global mode but everything is a work in progress)

steve d.