|
On 4/7/2014 7:03 AM, Valerio Schiavoni wrote:
And for your curiosity, on one of the smaller files, i get sensible differences between 7z and bz2 : $time 7z e -so -bd enwiki-20140304-pages-meta-history8.xml-p000662352p000665000.7z 2>/dev/null > /dev/null 7z e -so -bd enwiki-20140304-pages-meta-history8.xml-p000662352p000665000.7z 6.10s user 0.02s system 99% cpu 6.120 total $time bzcat < enwiki-20140304-pages-meta-history8.xml-p000662352p000665000.bz2 > /dev/null bzcat < enwiki-20140304-pages-meta-history8.xml-p000662352p000665000.bz2 > 61.26s user 0.14s system 99% cpu 1:01.41 total
It's strange Wikipedia has not moved to xz but is using a mix of 7z and bzip2, even the Linux kernel has moved to tar.xz. Both 7z and xz uses the newer LZMA2.
Unfortunately BWT+Huffman in bzip2 has roughly symmetrical times for compression-decompression. 7z/xz decompression is not symmetrical to compression, and will always be a lot faster than bzip2 at big block sizes. For multiple runs, recompressing the kaboodle to 7z/xz will probably greatly improve your runtimes. Things like lzo is a lot faster but will likely compress to only 50% or so for text data.
-- Cheers, Kein-Hong Man (esq.) Kuala Lumpur, Malaysia