lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


Florian Berger wrote:
I thought that stripping HTML tags was easy until I saw something like this:
<a href="http://www.example.com"; alt="> example"> example </a>

I believe this is not correct HTML code, even by old pre-XHTML standards. It should be alt="&gt; example". I am not sure this is valid XML, although I believe XML is more strict on encoding & and < than on >.

Of course, the problem is that browsers are quite tolerant on this kind of error (IMHO, there should have been stricter to start with, the Web would be much more clean...), so are likely to find these constructs in real pages.

My code was:
local s = '<a href="http://www.example.com"; alt="> example"> example </a>'
s = string.gsub(s, '<.->', ' ')
print(string.gsub(s, '<.->', ' '))

-> example"> example


I have seen some examples using PHP and regular expressions. Programming in Lua 20.1 says that Lua cannot do all what POSIX implementation does (http://www.lua.org/pil/20.1.html). Can this be done in Lua? All that come to my mind are captures but I'm not sure if they help at all. Of course my example works in most of the cases, but it would be nice to have it work even better.

Note there is a PCRE wrapper for Lua, if you need more powerful RegExp.
If, of course, you are not stuck with standard Lua.

Looking at other answers, I don't think an XML parser would do the job. Regular HTML isn't even XML compliant, so an XML parser would complain about unclosed tags like <br> or <p>.

A full tokeniser/parser could be a solution, perhaps too costly for your need... HTML isn't really easy to parse, even more if you have to be as tolerant to errors as the browsers are... (like the above, accepting -- in comments, raw & in text, etc.).

--
Philippe Lhoste
--  (near) Paris -- France
--  http://Phi.Lho.free.fr
--  --  --  --  --  --  --  --  --  --  --  --  --  --