lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


On 6/26/2012 1:03 AM, Enrico Colombini wrote:
On 25/06/2012 18.46, KHMan wrote:
It probably arrived at similar sizes due to different
mechanisms. File
header overhead (and deflate method overheads) is also
significant for
small files. I meant that it should not be any more compressible
than an
equivalent string of digits output by a strong encryption method.

sieve.number is compressed mainly by reducing the length of
codewords to
under 4 bits per symbol due to 10 symbols (digits), while the
stream of
digits itself has no real pattern and cannot be compressed via
LZ coding.

I looked at the sizes inside a single zip file to reduce header
overload, but I agree it's probably just a nice coincidence. It
would be interesting to confirm this by repeating the test on
large files.

Interestingly enough, my quickie assessment of compression behaviour in the above was seriously wrong... Your 387 bytes result for Infozip indicates LZ did some work, since Huffman would have taken 4 bits/symbol and Infozip needed to store the encoded Huffman table.

Turned out that having only 10 symbols allowed quite a lot of repeated patterns, much like the "birthday problem" in statistics. I have an instrumented liblzf, and it reported something like 119 matches of 3 bytes, and some 4 byte matches too. Not too shabby.

Goes to show that analysis-at-a-quick-glance does not work all the time... my bad. :-)

--
Cheers,
Kein-Hong Man (esq.)
Kuala Lumpur, Malaysia