[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Re: Parsing big compressed XML files
- From: Petite Abeille <petite.abeille@...>
- Date: Thu, 3 Apr 2014 20:04:26 +0200
On Apr 3, 2014, at 12:01 PM, Valerio Schiavoni <valerio.schiavoni@gmail.com> wrote:
> Currently we do it in Java, but it takes ages to parse all those files (around 18 hours).
18 hours?! Including download? Or?
> The parsing consists in extracting the timestamp of the each of the versions for a page in the dump.
So, broadly speaking:
$ curl http://dumps.wikimedia.org/enwiki/20140203/enwiki-20140203-pages-articles-multistream.xml.bz2 | bzcat | xml2
The above file is around 10.6 GB compressed… the entire processing runs end-to-end in about 120 minutes on a rather pedestrian network connection, on a rather diminutive laptop… not quite apples to apples, but still, a far cry from 18 hours… me guess there is room for improvement :)
(xml parsing curtesy of xml2: http://www.ofb.net/~egnor/xml2/ )