lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


On 04/03/2014 06:01 PM, Valerio Schiavoni wrote:
Hello,
I was wondering if you are aware of an efficient way to parse a big compressed XML file with Lua (or LuaJIT).
The files store the wikipedia dumps: http://dumps.wikimedia.org/enwiki/20140304/ (all those files with name pages-meta-history).
Currently we do it in Java, but it takes ages to parse all those files (around 18 hours).
The parsing consists in extracting the timestamp of the each of the versions for a page in the dump.

Suggestions ?

Thanks,
Valerio

Could you please give some more details or some examples of the XML schema you want to parse?

As a general suggestion, you should use `lzlib' to read compressed data, together with a
SAX type xml parser (such as lxp or XLAXML) to deal with _HUGE_ xml input files.

But if your schema is _simple enough_ and the xml file is guaranteed to be _well formatted_,
you might not need a full functional xml parser (even in SAX style) at all.
in such cases, some simple carefully written pattern matching might be enough.