[Date Prev][Date Next][Thread Prev][Thread Next]
- Subject: Re: Avoiding FFI- allocations + using SSE-vectors
- From: Mike Pall <mikelu-1202@...>
- Date: Mon, 6 Feb 2012 13:10:47 +0100
Wolfgang Pupp wrote:
> I also tried to make it use SSE, and that seems to work just fine
> (MinGW on Win7 32). It needs a tiny wrapper-dll, because LuaJIT can't
> directly call ffi-functions with vector arguments (yet!)- so I pass
> them via pointers.
> I only implemented single-precision-4-float addition for now anyway-
> it's just a proof of concept, I think ffi- vector operations are
> somewhere on Mike's TODO-list (maybe someone will even sponsor that
> and we'll have it in a blink ;).
My plan for implementing SIMD operations is this:
- Add generic vector type(s) to the IR of the JIT compiler.
- Record the minimum required vector ops needed for the basic
initialization and assignment semantics. We can get by with
init/splat, select/project and load/store.
- Decompose vector ops, if the machine-specific backend doesn't
support a particular type.
- Add support for the basic hardware vector ops to the backends.
First would be SSE for x86/x64.
- Add suport for user-definable intrinsic (builtins) with machine
code templates, e.g.:
__v2df __builtin_ia32_addpd(__v2df, __v2df) __mcode("660F58rM");
Needs to be done for each backend, x86/x64 first. The intrinsics
will be inlined into the JIT-compiled machine code, of course.
- Add a ffi.vec module that defines standard vector types and
attaches the machine-specific intrinsics as methods/metamethods:
local vec = require("ffi.vec")
local v2df = vec.v2df
local v1 = v2df(1.5, 2.5)
local v2 = v2df(10.0, -4.25)
print(v1 + v2) --> cdata<__v2df>: (11.5, -1.75)
- Miscellaneous stuff, e.g. FFI calling conventions for passing
vectors to C functions or alignment restrictions for vectors.
- Add allocation sinking and store sinking to avoid (most) vector
allocations. SIMD vectors are value types, so this is a bit
easier: vectors are immutable, passed by value and you cannot
get a reference to the boxed contents. I.e. the boxing can be
eliminated in almost all cases.
But one really needs generic support for allocation/store
sinking. That would allow efficient support for complex numbers
and for short arrays of vectors (SIMD matrix types), too.
- Add auto-vectorization. Umm, err ... that's really tough. Let's
leave that out of the plan for now. :-)
Phew ... ok, so that's a lot of work (a couple months). If any
potential sponsor is interested, I'm willing to go for this.
Please contact me via the address given on the sponsorship page:
Another thing to consider is that I'd rather postpone all of this
to LuaJIT 2.1. Part of it is a complete redesign of the GC and the
memory allocator. That makes some things a lot easier (e.g. aligned
I really need to freeze LuaJIT 2.0 and finally make a non-beta
release once the MIPS port and the console ports are done. Maybe
ARM VFP support (hardware floating-point ops) ought to be part of
LuaJIT 2.0, too (sponsor needed). Ok, so there are plenty of items
left on the TODO list, but one has to draw a line somewhere.