lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


Hello Luiz
thanks for your reply. But, your solution seems to suggest I first need to decompress before hand the (potentially huge) compressed file. 
Is that the case? Do you think it is possible to do it directly on the compressed file, or by pipe'ing its content with zcat for instance ? 

Best,
valerio


On Thu, Apr 3, 2014 at 2:28 PM, Luiz Henrique de Figueiredo <lhf@tecgraf.puc-rio.br> wrote:
> I was wondering if you are aware of an efficient way to parse a big
> compressed XML file with Lua (or LuaJIT).
> The files store the wikipedia dumps:
> http://dumps.wikimedia.org/enwiki/20140304/ (all those files with name
> pages-meta-history).
> The parsing consists in extracting the timestamp of the each of the
> versions for a page in the dump.

Since these XML files seem to be formatted with one field by line,
a simple loop like the one below should suffice.

for l in io.lines("enwiki-20140304-pages-meta-current1.xml-p000000010p000010000") do
        if l:match("<timestamp>")
        or l:match("<title>")
        or l:match("<id>")
        then
                print(l)
        end
end

Of course, you probably need to do a more sophisticated parsing.