[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Re: Stripping HTML tags
- From: Rici Lake <lua@...>
- Date: Mon, 15 Aug 2005 15:16:32 -0500
(Please don't reply to messages when you're starting a new thread. It's
confusing.)
On 15-Aug-05, at 2:44 PM, Florian Berger wrote:
I thought that stripping HTML tags was easy until I saw something like
this:
<a href="http://www.example.com" alt="> example"> example </a>
That would be non-trivial to handle with a regular expression, although
I think it is possible.
However, you would have quite a bit of trouble with some other
legitimate HTML constructions, particularly comments (<!-- I left out
the <p> tag here -->) and embedded javascript. If you want a
bullet-proof html parser, you should probably use a tokenizer.
s = string.gsub(s, '<.->', ' ')
This might prove to be a bit faster, but it would fare no better with
the alt=">.." example:
s = string.gsub(s, "%b<>", " ")
I'm not convinced by the substitution of a tag with a space, though.
The following sequence is not rendered with a space:
<b>over</b>-specified