[Date Prev][Date Next][Thread Prev][Thread Next]
- Subject: Re: NumericLua optimization trick: lazy copying
- From: Mike Pall <mikelu-0605@...>
- Date: Thu, 11 May 2006 04:11:07 +0200
Paul Chiusano wrote:
> >for q=1,1000 do
> > x = I + matrix.inv(x)
> Each matrix/vector has two parts: a 'core', and a 'pending' list - the
> core is the actual data structure itself; the 'pending' list is just a
> list of operations (lua functions) which must be applied to a copy of
> the core before it can perform any operation.
I've used lazy evaluation schemes in the past to avoid the
temporary object creation problem. But I did it the other way
round: keep track of an expression tree (or list) and give
returned objects just a short reference to it.
Only evaluate the expression when:
1. the tree (or list) grows too large,
2. a result outside of the domain is needed (e.g. a dot product
should return a number and not a one-element object) or
3. the last operation needs a copy anyway.
This works well for common linear expressions like Y = A*i + B*j
where the intermediate results are not needed. Only three very
short and constant size reference objects need to be created in
this case. These are easy to recycle by the memory allocator.
In the case of a vector/matrix library for Lua I suggest using
fixed size userdata objects. They can hold up to a fixed size
expression in a plain list (in postfix order). This avoids
problems with tree reference counting and should be good enough
for most practical purposes.
The only thing you need to figure out now, is how to evaluate an
expression with the minimum amount of copies and the best use of
the underlying domain operations. Extra bonus points if you JIT
compile the expression and the loops around it. :-)
In fact I did something like this (by hand) with the Lua entry
for the "pidigit" shootout benchmark:
(Ignore the highlighting -- it thinks #x is a comment.)
The particular PI calculation algorithm required by the specs
needs multi-precision numbers. Alas, Lua does not have such an
object type by default. The one external library used by the
original program was written in pure Lua and very slow. The other
bindings to C libraries had installation problems (the shootout
maintainers don't have endless time at their hands) or didn't
work well with Lua 5.1. Sigh.
So I wrote a self-standing implementation in pure Lua with a few
tricks. There are only a handful of linear expressions needed for
the calculation and they are dynamically pasted into the
surrounding loops and compiled on-the-fly.
Ok, in this case I hand-selected the expressions. But they could
have been derived automatically by a lazy evaluation approach.
Doing the full expressions inline is way better than a plain
mapping onto the usual +-*/ primitives. It avoids many object
copies and in turn reduces cache trashing which is a major
bottleneck in this benchmark.
The performance is quite competitive when compiled with LuaJIT.
Especially considering that practically all other entries use
dedicated MPN libraries written in C! And the winning entries all
link to GMP which uses hand-tuned assembler code:
So in most cases the performance of the language itself is not
measured, just the performance of the bignum binding.