[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Re: Parsing big compressed XML files
- From: Luiz Henrique de Figueiredo <lhf@...>
- Date: Thu, 3 Apr 2014 09:28:27 -0300
> I was wondering if you are aware of an efficient way to parse a big
> compressed XML file with Lua (or LuaJIT).
> The files store the wikipedia dumps:
> http://dumps.wikimedia.org/enwiki/20140304/ (all those files with name
> pages-meta-history).
> The parsing consists in extracting the timestamp of the each of the
> versions for a page in the dump.
Since these XML files seem to be formatted with one field by line,
a simple loop like the one below should suffice.
for l in io.lines("enwiki-20140304-pages-meta-current1.xml-p000000010p000010000") do
if l:match("<timestamp>")
or l:match("<title>")
or l:match("<id>")
then
print(l)
end
end
Of course, you probably need to do a more sophisticated parsing.