lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


On Apr 3, 2014, at 12:01 PM, Valerio Schiavoni <valerio.schiavoni@gmail.com> wrote:

> Currently we do it in Java, but it takes ages to parse all those files (around 18 hours).

18 hours?! Including download? Or? 

> The parsing consists in extracting the timestamp of the each of the versions for a page in the dump.

So, broadly speaking: 

$ curl http://dumps.wikimedia.org/enwiki/20140203/enwiki-20140203-pages-articles-multistream.xml.bz2 | bzcat | xml2

The above file is around 10.6 GB compressed… the entire processing  runs end-to-end in about 120 minutes on a rather pedestrian network connection, on a rather diminutive laptop… not quite apples to apples, but still, a far cry from 18 hours… me guess there is room for improvement :)

(xml parsing curtesy of xml2: http://www.ofb.net/~egnor/xml2/ )