[Date Prev][Date Next][Thread Prev][Thread Next]
- Subject: Re: LuaJIT2 performance for number crunching
- From: Mike Pall <mikelu-1102@...>
- Date: Wed, 23 Feb 2011 00:31:37 +0100
Francesco Abbate wrote:
> I guess that this problem
> can be easily solved by loading the cblas library but I can give more
> help if needed.
Umm, is this the ancient NETLIB cblas code? You realize this is
not tuned at all for modern CPUs? And it's not vectorized, so if
you're using it and think you get a speedup, you're mistaken.
Also, the DLL you provide, uses x87 code and not SSE ...
Loops over vectors are certainly faster if written in plain Lua
and compiled with LuaJIT (provided the vectors are not too short).
[I can understand the desire to avoid rewriting all of cblas in
Lua, but a daxpy loop seems easy. And BTW: do NOT unroll it by
hand, this is counter-productive on modern CPUs.]
> I've given a look at the trace and it seems that the root of the
> problem is the cblas function that LuaJIT2 doesn't like:
> [TRACE --- rkf45vec-out.lua:78 -- NYI: unsupported C function type at
> the function incriminated is cblas_daxpy. But I don't really know.
My fault. Just released a fix for this to git HEAD. Much faster
BTW: Consider checking all of your code for bad uses of global
cblas = ffi.load('libgslcblas-0')
local cblas = ffi.load('libgslcblas-0')