Re: Parsing big compressed XML files

Subject: Re: Parsing big compressed XML files
From: Valerio Schiavoni &lt;valerio.schiavoni@ ... &gt;
Date: Thu, 3 Apr 2014 16:55:18 +0200

Hello Luiz

thanks for your reply. But, your solution seems to suggest I first need to decompress before hand the (potentially huge) compressed file.

Is that the case? Do you think it is possible to do it directly on the compressed file, or by pipe'ing its content with zcat for instance ?

Best,

valerio

On Thu, Apr 3, 2014 at 2:28 PM, Luiz Henrique de Figueiredo <lhf@tecgraf.puc-rio.br> wrote:

> I was wondering if you are aware of an efficient way to parse a big
> compressed XML file with Lua (or LuaJIT).
> The files store the wikipedia dumps:
> http://dumps.wikimedia.org/enwiki/20140304/ (all those files with name
> pages-meta-history).

> The parsing consists in extracting the timestamp of the each of the
> versions for a page in the dump.

Since these XML files seem to be formatted with one field by line,
a simple loop like the one below should suffice.

for l in io.lines("enwiki-20140304-pages-meta-current1.xml-p000000010p000010000") do
if l:match("<timestamp>")
or l:match("<title>")
or l:match("<id>")
then
print(l)
end
end

Of course, you probably need to do a more sophisticated parsing.