Re: Parsing big compressed XML files

lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

Subject: Re: Parsing big compressed XML files
From: Petite Abeille <petite.abeille@...>
Date: Thu, 3 Apr 2014 20:04:26 +0200

On Apr 3, 2014, at 12:01 PM, Valerio Schiavoni <valerio.schiavoni@gmail.com> wrote:

> Currently we do it in Java, but it takes ages to parse all those files (around 18 hours).

18 hours?! Including download? Or? 

> The parsing consists in extracting the timestamp of the each of the versions for a page in the dump.

So, broadly speaking: 

$ curl http://dumps.wikimedia.org/enwiki/20140203/enwiki-20140203-pages-articles-multistream.xml.bz2 | bzcat | xml2

The above file is around 10.6 GB compressed… the entire processing  runs end-to-end in about 120 minutes on a rather pedestrian network connection, on a rather diminutive laptop… not quite apples to apples, but still, a far cry from 18 hours… me guess there is room for improvement :)

(xml parsing curtesy of xml2: http://www.ofb.net/~egnor/xml2/ )

Follow-Ups:
- Re: Parsing big compressed XML files, Valerio Schiavoni

References:
- Parsing big compressed XML files, Valerio Schiavoni

Prev by Date: Re: Parsing big compressed XML files
Next by Date: Proposal: strict type checking for patterns (Was: Number-to-string coercion gotcha)
Previous by thread: Re: Parsing big compressed XML files
Next by thread: Re: Parsing big compressed XML files
Index(es):
- Date
- Thread