lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]



Thanks for comments and tips.

Roberto Ierusalimschy wrote:
<a href="http://www.example.com"; alt="> example"> example </a>
Maybe you could preprocess the string, finding all substrings inside
quotes and escaping "dangerous" characters to something else;
something like this (untested):

Interesting idea, that might be something to try.

Rici Lake wrote:
> However, you would have quite a bit of trouble with some other
> legitimate HTML constructions, particularly comments (<!-- I left out > the <p> tag here -->) and embedded javascript. If you want a
> bullet-proof html parser, you should probably use a tokenizer.

I thought a little bit about that and I think that the right order is to remove scripts and comments first. And after that remove other tags.

Chris Marring wrote:
> You could just use luaexpat and then extract out what you need. This
> is especially easy with the Lua Object Model feature, which simply
> returns the HTML as a hierarchy of tables. Expat is very good at
> grokking all the twisty bits of HTML, so this could help get past all > that...

How well does LuaExpat work if HTML is not clean or valid?

f