lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


Hi Mike,

thank you very much for what you have dove. I've seen that with the
luajit2 from git HEAD now the code in vector form does perform much
better. Here the results:

luajit2/git HEAD / array impl / 0m0.296s
luajit2/git HEAD / unroll impl / 0m0.109s
luajit2/ beta6 / array impl / 0m10.860s
luajit2/ beta6 / unroll impl / 0m0.109s
C (GSL) / C opt(*) / array impl / 0m0.206s

(*) CFLAGS="-O2 -march=native -mfpmath=sse"

The difference between git HEAD and 2.0-beta6 is huge (~ 100x)
(compiled trace vs interpreted code I guess). Could you tell us more
about what you have done in LuaJIT2 ?

2011/2/23 Mike Pall <mikelu-1102@mike.de>:
> Umm, is this the ancient NETLIB cblas code? You realize this is
> not tuned at all for modern CPUs? And it's not vectorized, so if
> you're using it and think you get a speedup, you're mistaken.
> Also, the DLL you provide, uses x87 code and not SSE ...

Yes, I'm aware of that but for the moment the objective was to be sure
that LuaJIT2 actually compiles the code and we don't fall back in
interpreted mode. As everyone can see this make un huge difference.

Just to be clear, when I was talking about vectorized form I mean an
implementation based on C arrays and cblas or similar. This is opposed
to the implementation with templates where I unroll all the loops and
I expand the array to local Lua variables.

The difference between a plain FPU based implementation and an
optimized SSE one is a further speedup that can be addressed later. I
see that actually, with luajit2 git HEAD the difference between
vectorized (cblas) and not-vectorized (explicit unroll with locals)
code is a factor ~3, this is certainly non-negligible but in this case
the problem come from the cblas implementation that I've picked up but
we can replace it with a better implementation at any moment.

> Loops over vectors are certainly faster if written in plain Lua
> and compiled with LuaJIT (provided the vectors are not too short).

I agree but we have seen that if vectors are too shorts the trace is
not really optimized so I've chosen an implementation based on a FFI
call with cblas to be sure that in any case we have a reasonable
execution speed with both very small or huge vectors.

>> I've given a look at the trace and it seems that the root of the
>> problem is the cblas function that LuaJIT2 doesn't like:
>>
>> [TRACE --- rkf45vec-out.lua:78 -- NYI: unsupported C function type at
>> rkf45vec-out.lua:83]
>>
>> the function incriminated is cblas_daxpy. But I don't really know.
>
> My fault. Just released a fix for this to git HEAD. Much faster
> now.

Well, I will not say that it was your fault. I guess that we are
testing LuaJIT2 with more elaborate and complex algorithms and, as
LuaJIT2 is very young we are discovering some problems. I believe that
this work will help to make LuaJIT2 even more reliable and fast in
real world application.

What we are doing shouldn't be underestimated. We are implementing
numerical algorithms with Lua and we obtain performance comparable or
better than C or fortran. I think that this is a major advance in
computer science because, as far as I know, this is the first time
that an interpreted programming language like Lua can deliver this
kind of execution speeds. If we look at what happens in this
perspective it is certainly not surprising that, as we implement more
and more complex algorithms we can discover some minor problem with
the JIT.

I believe that we are working to something great so I'm highly
motivated and I would like to thank you for the fantastic work that
you are doing. I really appreciate your feedback for the
implementation of numerical algorithms and I believe that both LuaJIT
and GSL Shell will benefit of this work.

Best regards,
Francesco