lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

Chris Marrin wrote:
> On Mar 9, 2008, at 8:20 AM, David Given wrote:
> [snip]
>>>> With C, I can drastically reduce the size of the source code by using a
>>>> cruncher utility: this reads in the C source and emits an equivalent
>>>> but
>>>> smaller source file by removing whitespace, comments, renaming
>>>> locals to
>>>> smaller versions, etc.
>>> My lstrip does all that, except rename locals; but it's in C.
>> [snip]
> I'm not sure of the nature of your app, but maybe it would be easier to
> just compress your Lua source using gzip and then use gzio or something
> to uncompress it? Gzip does a great job of compressing things like long
> variable names which repeat throughout the text. You may still want to
> get rid of comments to get maximum compression, though.

For source code, comment and whitespace removal is a pretty big
win. To improve further by renaming locals will help less, but
AFAIK nobody's written a tool to do that yet. I think it is a
limited 'win'. To illustrate, say we have two versions:

A:	i = i + 1
	i = i + 1

B:	foobar = foobar + 1
	foobar = foobar + 1

B is longer by 20 bytes. For version A, 'i' will be one literal
code, the second 'i' will be one literal code, and the entire line
2 will be one match code. For version B, 'foobar' will be
literals, the 'foobar ' will be one match code, and line 2 will
still be one match code. Roughly, compressed B is longer than
compressed A by 4 literals and 1 match, say about 6 bytes in
total, less with compressed literals. I suspect the overall effect
of renaming locals for compressed source code is just a few percent.

zlib is great for general-purpose compression, as Ico has pointed
out. To get an equivalent extra few percent of savings, a solid
archive (concatenation of files, then compressed) will do much
better than separate files. Solid bzip2 or lzma will give even
better results. It is likely that the effect of renaming locals
will be smaller than say, the improvement due to solid archiving,
or switching to bzip2, etc. At this point, it is not a real
necessity and becomes more of a benchmarking exercise.

For binary chunks, all the size_t and integers will give a lot of
00 00 00 octets, so an LZ-based compressor might end up with lower
average match lengths and more matches, thus possibly poorer
compression. VM instructions, due to field alignment, also benefit
less from compression of their constituent octets.

So all of this becomes very subjective and very much dependent on
the characteristics of particular sources... For some old Yueliang
code, I did some testing long ago, and zip compression ratio is as

			Size	Zipped	Ratio
Original sources	130162	 29546	22.7%
with LuaSrcDiet		 48308	 13841	28.7%
luac			108174	 32238	29.8%
luac -s			 64930	 21867	33.7%

One can also tokenize the source like old Basic interpreters, but
again, that is more of an academic exercise, since we can easily
get good or better compression by choosing a better compressor.

Kein-Hong Man (esq.)
Kuala Lumpur, Malaysia