[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Re: Any html scraping libraries?
- From: Michal Kottman <k0mpjut0r@...>
- Date: Sun, 10 Apr 2011 19:31:12 +0200
On Sat, 2011-04-09 at 00:35 +0400, Alexander Gladysh wrote:
> Hi, list!
>
> I'm looking for a Lua module to scrape some data from a (possibly
> broken) HTML page.
>
> Any usable ones out there?
Also, I started toying around with the QtWebKit module, which is
available through lqt [1]. In previous versions of Qt (4.5 and earlier)
it couldn't actually do much more than load and display a page.
Starting with 4.6, the QtWebKit API [2] offers full DOM access to
elements of a web page you are loading. The API supports CSS 2
selectors, so you can get easily to the elements you are interested in.
Right now I am playing with modifying a page that is loaded from the
web, something like GreaseMonkey, just implemented in Lua :)
In case you only need the "robust HTML source" (as robust as WebKit is),
you can use the following code:
require 'qtcore'
require 'qtgui'
require 'qtwebkit'
local A = QApplication(select('#',...)+1, {...})
local W = QWebView()
-- will be called when the page load is finished
W:__addmethod('process(bool)', function(self, ok)
-- get the source as QString
local sourceString = self:page():mainFrame():toHtml()
-- convert it to Lua string
local source = sourceString:toUtf8()
print(source) -- or process as needed...
A.quit() -- quit the event loop
end)
W:connect('2loadFinished(bool)', W, '1process(bool)')
W:setUrl(QUrl('http://www.lua.org/'))
A.exec()
[1] https://github.com/mkottman/lqt
[2] http://doc.qt.nokia.com/4.6/qtwebkit.html