Rici Lake wrote:
On 15-Aug-05, at 2:44 PM, Florian Berger wrote:

I thought that stripping HTML tags was easy until I saw something like this:
<a href=""; alt="> example"> example </a>

That would be non-trivial to handle with a regular expression, although I think it is possible.

However, you would have quite a bit of trouble with some other legitimate HTML constructions, particularly comments (<!-- I left out the <p> tag here -->) and embedded javascript. If you want a bullet-proof html parser, you should probably use a tokenizer.

You could just use luaexpat and then extract out what you need. This is especially easy with the Lua Object Model feature, which simply returns the HTML as a hierarchy of tables. Expat is very good at grokking all the twisty bits of HTML, so this could help get past all that...

