• Subject: Re: Avoiding FFI- allocations + using SSE-vectors
• From: Adam Strzelecki <ono@...>
• Date: Mon, 6 Feb 2012 12:53:53 +0100

```> I just tried another approach:
> A circular buffer for intermediate results- there are no more
> allocations for calling arithmetic functions, *BUT* if you don't copy
> results (and just keep references instead), they *will* be overwritten
> (sooner or later). (…) I also tried to make it use SSE, and that seems to work just fine
> (MinGW on Win7 32). It needs a tiny wrapper-dll, because LuaJIT can't
> directly call ffi-functions with vector arguments (yet!)- so I pass
> them via pointers.

That's cute idea. Testing:
local a, b = M.new(0,0,0,0), M.new(1,2,3,4)
for i = 1, 20000000 do a = a+b end
print(string.format('in %f seconds', os.clock()-start), a)

I get these 20M operations in 0.299074 seconds. However if I comment out _mm_add_ps in lua_mm_add_ps C function it drops only to 0.279646 seconds, so actual time it spends in vector addition is just 6% of total call time, which is pretty disappointing.

4x4 matrix addition via pure Lua function taking 2x16 arguments and returning 16 arguments (see my former mail) takes 0.104428 seconds for 20M operations (16x20M additions). So it seems it can't do any better when working on complex (boxed) types (ctypes).

Since results when working on scalars directly are best so far, I am thinking about adding some synthetic sugar for Lua that would wrap vectors into many scalar variables kept directly on stack i.e.:

local a:4 = 0, 0, 0, 0
local b:4 = 1, 2, 3, 4
for i = 1, 20000000 do
a = a + b
end

Translating to:
local a_1, a_2, a_3, a_4 = 0, 0, 0, 0
local b_1, b_2, b_3, b_4 = 1, 2, 3, 4
for i = 1, 20000000 do
a_1, a_2, a_3, a_4 = a_1 + b_1, a_2 + b_2, a_3 + b_3, a_4 + b_4
end

But then it needs changing the Lua parser, which is controversial.

Cheers,
--