lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


Rici Lake wrote:
(Please don't reply to messages when you're starting a new thread. It's confusing.)

On 15-Aug-05, at 2:44 PM, Florian Berger wrote:

I thought that stripping HTML tags was easy until I saw something like this:
<a href="http://www.example.com"; alt="> example"> example </a>


That would be non-trivial to handle with a regular expression, although I think it is possible.

However, you would have quite a bit of trouble with some other legitimate HTML constructions, particularly comments (<!-- I left out the <p> tag here -->) and embedded javascript. If you want a bullet-proof html parser, you should probably use a tokenizer.

You could just use luaexpat and then extract out what you need. This is especially easy with the Lua Object Model feature, which simply returns the HTML as a hierarchy of tables. Expat is very good at grokking all the twisty bits of HTML, so this could help get past all that...

--
chris marrin                ,""$,
chris@marrin.com          b`    $                             ,,.
                        mP     b'                            , 1$'
        ,.`           ,b`    ,`                              :$$'
     ,|`             mP    ,`                                       ,mm
   ,b"              b"   ,`            ,mm      m$$    ,m         ,`P$$
  m$`             ,b`  .` ,mm        ,'|$P   ,|"1$`  ,b$P       ,`  :$1
 b$`             ,$: :,`` |$$      ,`   $$` ,|` ,$$,,`"$$     .`    :$|
b$|            _m$`,:`    :$1   ,`     ,$Pm|`    `    :$$,..;"'     |$:
P$b,      _;b$$b$1"       |$$ ,`      ,$$"             ``'          $$
 ```"```'"    `"`         `""`        ""`                          ,P`
"As a general rule,don't solve puzzles that open portals to Hell"'