lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

Florian Berger wrote:

Chris Marring wrote:
 > You could just use luaexpat and then extract out what you need. This
 > is especially easy with the Lua Object Model feature, which simply
 > returns the HTML as a hierarchy of tables. Expat is very good at
> grokking all the twisty bits of HTML, so this could help get past all > that...

How well does LuaExpat work if HTML is not clean or valid?

My experience with expat (NOT used with LuaExpat) is that it makes a valiant effort to deal with a few things. But for the most part, invalid HTML generates an error and aborts. I think there is a way to get expat to continue if there is a validity error. For instance, I think you can get it to handle the case where a an EndElement has the wrong name. But for the most part invalid HTML, like invalid C, is hard to fix and make any sense of. And I don't know how easy it would be to get LuaExpat to be tolerant of errors.

My general rule is "always use valid HTML" :-)

chris marrin              ,""$, "As a general rule,don't solve puzzles        b`    $  that open portals to Hell" ,,.
        ,.`           ,b`    ,`                            , 1$'
     ,|`             mP    ,`                              :$$'     ,mm
   ,b"              b"   ,`            ,mm      m$$    ,m         ,`P$$
  m$`             ,b`  .` ,mm        ,'|$P   ,|"1$`  ,b$P       ,`  :$1
 b$`             ,$: :,`` |$$      ,`   $$` ,|` ,$$,,`"$$     .`    :$|
b$|            _m$`,:`    :$1   ,`     ,$Pm|`    `    :$$,..;"'     |$:
P$b,      _;b$$b$1"       |$$ ,`      ,$$"             ``'          $$
 ```"```'"    `"`         `""`        ""`                          ,P`