|
> I was wondering if you are aware of an efficient way to parse a big
> compressed XML file with Lua (or LuaJIT).
> The files store the wikipedia dumps:
> http://dumps.wikimedia.org/enwiki/20140304/ (all those files with name
> pages-meta-history).
> The parsing consists in extracting the timestamp of the each of theSince these XML files seem to be formatted with one field by line,
> versions for a page in the dump.
a simple loop like the one below should suffice.
for l in io.lines("enwiki-20140304-pages-meta-current1.xml-p000000010p000010000") do
if l:match("<timestamp>")
or l:match("<title>")
or l:match("<id>")
then
print(l)
end
end
Of course, you probably need to do a more sophisticated parsing.