[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Re: Stripping HTML tags
- From: Chris Marrin <chris@...>
- Date: Mon, 15 Aug 2005 14:34:20 -0700
Rici Lake wrote:
(Please don't reply to messages when you're starting a new thread. It's
confusing.)
On 15-Aug-05, at 2:44 PM, Florian Berger wrote:
I thought that stripping HTML tags was easy until I saw something like
this:
<a href="http://www.example.com" alt="> example"> example </a>
That would be non-trivial to handle with a regular expression, although
I think it is possible.
However, you would have quite a bit of trouble with some other
legitimate HTML constructions, particularly comments (<!-- I left out
the <p> tag here -->) and embedded javascript. If you want a
bullet-proof html parser, you should probably use a tokenizer.
You could just use luaexpat and then extract out what you need. This is
especially easy with the Lua Object Model feature, which simply returns
the HTML as a hierarchy of tables. Expat is very good at grokking all
the twisty bits of HTML, so this could help get past all that...
--
chris marrin ,""$,
chris@marrin.com b` $ ,,.
mP b' , 1$'
,.` ,b` ,` :$$'
,|` mP ,` ,mm
,b" b" ,` ,mm m$$ ,m ,`P$$
m$` ,b` .` ,mm ,'|$P ,|"1$` ,b$P ,` :$1
b$` ,$: :,`` |$$ ,` $$` ,|` ,$$,,`"$$ .` :$|
b$| _m$`,:` :$1 ,` ,$Pm|` ` :$$,..;"' |$:
P$b, _;b$$b$1" |$$ ,` ,$$" ``' $$
```"```'" `"` `""` ""` ,P`
"As a general rule,don't solve puzzles that open portals to Hell"'