[Date Prev][Date Next][Thread Prev][Thread Next]
- Subject: Re: Wanted: Lua source code cruncher
- From: KHMan <keinhong@...>
- Date: Tue, 11 Mar 2008 01:59:15 +0800
David Given wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> KHMan wrote:
> I've investigated gzipping the relevant bits, and indeed pm comes with
> an executable that does this. The savings are huge --- the executable
> goes from about 300kB to 70kB --- but oddly enough that's actually
> rather counterproductive. Since this is a build tool, it's normally put
> in build trees that are distributed as tar.gz or tar.bz2 files; and
> compressing the entire tar file works better if pm is uncompressed. That
> is, compressing the C and Lua code in the executable causes the
> resulting tar.bz2 to be *larger* than if it were uncompressed. I did
> some measurements:
The results are pretty much expected. Compression thrives on
redundancy or repeated patterns. We're seeing different effects
here. In the size list:
Version None gzip -9 bzip -9
pm_uncompressed 285kB 81kB 71kB
pm_7bit 112kB 81kB 82kB
pm_8bit 82kB 85kB 84kB
pm_uncompressed has the most repeating patterns, hence it
compresses best. bzip2 is quite a bit better than gzip, moreso for
large data blocks due to BWT block sorting.
pm_8bit can't be compressed a second time. The shell script bit at
the beginning can still be compressed a bit, but the 8-bit
compressed data -- random patterns -- can no longer be compressed
again. Overall, gain will tend to zero. bzip's BWT step only works
with repeated patterns. Again, zero gain, more or less.
pm_7bit still compresses a bit, but not because of the LZ step in
gzip or the BWT step in bzip2. Both LZ and BWT will fail to do
anything because there is hardly any repeated patterns. The
savings is due to Huffman coding dealing with the 6-bit symbol
coverage of the uuencoded data. Lop 2 bits off and 75%*112 gives
84, as expected.
So for putting pm in archive releases and compressing it, the one
with the most redundant patterns is best for compression, i.e.
pm_uncompressed. Solid archives (tarred) also improves the
potential for matching and encoding repeated patterns, hence a
slightly better compression ratio. LZMA (7-zip) is usually better
than bzip2, but I think pervasive adoption will take a few more years.
> I've investigated luac, but not only is it non-portable but it doesn't
> reduce the output nearly as much as one would expect. (For pm, it
> reduces the source down to 61% --- not as much as LuaSrcDiet does.) It
> *would*, however, allow me to omit the compiler from the interpreter
> source... if it weren't for the fact that the Lua compiler is a key
> feature of pm. Maybe for another project.
Adding local renaming to LuaSrcDiet is interesting, and I have
pretty low priority plans to update it for 5.1 and have a TODO
list with optimizations to foolhardy extremes (local renaming,
string reformatting, number reformatting, etc.) Luiz also has
lstrip, which does pretty much the same thing.
When considering compressed sources, long local names is a form of
redundancy, and compressors are pretty good at eliminating this
redundancy. Local renaming can still save quite a few bytes when
considering the size of uncompressed Lua source code though.
I mentioned the problems with compressing binary chunks in an
earlier posting. When there is size info like: 05 00 00 00 ... 06
00 00 00, the best is a match of 3, 00 00 00, the minimum match
length for deflate in gzip. Not too great for compressing. VM
instructions fields are also not aligned to 8 bit boundaries,
making the performance of Huffman encoding worse. Source code has
more repeatable patterns and a nicer symbol distribution.
Anyway, I think 285K is much less scary than 1 to 2MB configure
Kein-Hong Man (esq.)
Kuala Lumpur, Malaysia