[Date Prev][Date Next][Thread Prev][Thread Next]
- Subject: Re: Stripping HTML tags
- From: Chris Marrin <chris@...>
- Date: Tue, 16 Aug 2005 07:49:05 -0700
Florian Berger wrote:
Chris Marring wrote:
> You could just use luaexpat and then extract out what you need. This
> is especially easy with the Lua Object Model feature, which simply
> returns the HTML as a hierarchy of tables. Expat is very good at
> grokking all the twisty bits of HTML, so this could help get past all
How well does LuaExpat work if HTML is not clean or valid?
My experience with expat (NOT used with LuaExpat) is that it makes a
valiant effort to deal with a few things. But for the most part, invalid
HTML generates an error and aborts. I think there is a way to get expat
to continue if there is a validity error. For instance, I think you can
get it to handle the case where a an EndElement has the wrong name. But
for the most part invalid HTML, like invalid C, is hard to fix and make
any sense of. And I don't know how easy it would be to get LuaExpat to
be tolerant of errors.
My general rule is "always use valid HTML" :-)
chris marrin ,""$, "As a general rule,don't solve puzzles
email@example.com b` $ that open portals to Hell" ,,.
,.` ,b` ,` , 1$'
,|` mP ,` :$$' ,mm
,b" b" ,` ,mm m$$ ,m ,`P$$
m$` ,b` .` ,mm ,'|$P ,|"1$` ,b$P ,` :$1
b$` ,$: :,`` |$$ ,` $$` ,|` ,$$,,`"$$ .` :$|
b$| _m$`,:` :$1 ,` ,$Pm|` ` :$$,..;"' |$:
P$b, _;b$$b$1" |$$ ,` ,$$" ``' $$
```"```'" `"` `""` ""` ,P`