[Date Prev][Date Next][Thread Prev][Thread Next]
- Subject: Re: LuaJIT2 performance for number crunching
- From: Francesco Abbate <francesco.bbt@...>
- Date: Sun, 13 Feb 2011 18:22:45 +0100
2011/2/13 Mike Pall <email@example.com>:
> Well, that's the main problem. LuaJIT is not tuned to deal with
> tons of loops that run only 2 iterations. It unrolls them, but
> there's a limit to that and this hits here.
thank you very much for your answer. Now everything is much more clear
for me. I've got the feeling that something in the code was preventing
LuaJIT from performing the needed optimizations and the results with
FFI was especially surprising to me but now everything is clear.
I have already understood in past that in order to let LuaJIT optimize
effectively the code you have to avoid use of closures and adopt a
more procedural style. Now I understand better that LuaJIT only
optimize the innermost loops and this explain why it does largely fail
to optimize rk4 code.
> In C++ one would use templates for that purpose. This instantiates
> a copy of the whole code for a specific number of dimensons.
> That's not as wasteful as it sounds, since you probably only ever
> use a finite set of dimensions, e.g. dim=2 and dim=3
Well, for me there is some confusion here. It is true that you can use
templates with the dimension of the system as a parameter to let the
C++ compiler more aggressive optimization but if you use just plain C
loops with a non-constant bound the code will be still optimized and
the difference of performance will be a minor one. I mean, optimizing
C/C++ compiler always optimize *all* the code. The problem here is
that LuaJIT2 will only optimize the innermost loops and do almost
nothing to optimize the code in the outer loop.
> In Lua we can do the same: specialize the Lua code at runtime for
> the number of dimensions, memoize the code in a table and dispatch
> based on the dimension. I.e. string.format + loadstring +
> memoization table.
Yes, you can do that and this is an interesting possibility but it
does also increase significantly the level of complexity. You cannot
just implement your algorithm but you have to generate at run-time the
code for your routine in a way that LuaJIT can optimize. This can be
done in same cases but I think that it cannot be proposed as a generic
method to implement algorithms.
>> Here the results of the benchmark:
>> LuaJIT2 with FFI, 0m47.805s
>> LuaJIT2 with GSL, 1m20.958s
> The tiny Lua function is never compiled. It runs interpreted and
> the overhead for the FFI is big when not compiled. But apparently
> still slightly less than the overhead of C code with userdatas.
Unfortunately the C code have to check if the index is within the
limits and it does have to check the type of the userdata. This latter
is a very expensive operations because of the way it is normally done
in Lua as described in PIL.
>> LuaJIT2 with Lua tables, 0m9.319s
>> Lua 5.1.4, 0m26.644s
>> C code with GSL, (compiled with -O2): 0m0.607s
> You're lucky it runs even that fast. None of the Lua code is
> compiled, due to all of those tiny loops! Have a look with -jv.
> Just for the sake of the experiment, I've expanded the code for
> two dimensions by hand (attached below). This is still problematic
> due to all of these loads and stores to y, y etc., where
> local variables would do. For the real thing you should definitely
> expand this to y_1, y_2 etc..
> Since this is really just one giant loop, all those stores and
> guards cause too many snapshots, so you'll need to run it with
> -Omaxsnap=200 for the plain Lua array. The FFI causes less guards,
> so it runs with the default settings.
> 5.68s rk4.lua Lua tables
> 5.68s rk4.lua FFI
> 0.24s rk4_2d.lua Lua tables with -Omaxsnap=200
> 0.23s rk4_2d.lua FFI
> Now this is ~25x faster than before, which also means it's faster
> than the pure C code!
> And if you'd use local variables instead of all those tiny arrays,
> performance would be even better.
Ok, now it is clear. I see that when you unroll the innermost loops
the code is *really* optimize and very close (or may be better) than C
code. Unfortunately, as you know, the RK4 algorithm cannot be
implemented like you have done because it will work only of ODE system
of dimension 2. The alternative is, as you suggested, to generate
dynamically the code and let the JIT compiler optimize the code. This
is definitely interesting but my feeling is that we cannot propose
that as a generic method to implement algorithms.
> But this actually does prove my point: rewriting mixed C/Lua code
> in pure Lua (+FFI) is the way to go.
Well, this is what I've done but it was not working!!! :-)
...so you should say: rewriting mixed C/Lua code in pure Lua (+FFI)
and program everything in procedural way using only a single big loop
to be sure that LuaJIT optimize it :-)
Seriously, this discussion was very important for me to better
understand the strength and the limitations of LuaJIT2. For me the
problem with nested loops is a showstopper to propose Lua for
numerical algorithm implementation. As you explained, if you use some
special techniques it can work and you can obtain excellent
performance but the price is a higher level of complexity and much
less expressive code. I can also imagine a case where the dimension of
the problem is quite big, let say 100, what about generating ~ 600
local variables and unroll a huge loop to let LuaJIT optimize it ? As
everyone will understand this is really awkward and not robust for
generic algorithm implementations.
In my point of view the potential of the LuaJIT compiler will be much
greater for numerical algorithms and also non-numerical ones when it
will be able to optimize at least two ot three levels of nested loops.
To be honest I don't have any idea about how difficult it would be to
be done but from a user point of view this is what would be needed.
Just to be clear, I don't want to diminish the outstanding technical
value of LuaJIT2 and of its FFI implementation. We all know that what
Mike have done till know is of exceptional quality and if I have made
the benchmark was just to better understand its potential for
numerical algorithm implementation.