I took a different approach, where instead of relying on luajit returning me fresh new scalars or vectors, I was providing them instead, which means I had to make sure these are preallocated in some way:

It can also return "new" values, but because these are expensive, these functions were suffixed with "new" (to be more obvious, and require more typing), like mulnew(a, b) would return a new vector3, while mul(result, a, b) would put it back in the result.

I was satisfied with the results, because there was no memory allocation, and even the assembly looked good (but then Mike came and said - NOT GOOD ENOUGH :) :) :) :) - but to me... juuuust fine!)

	Here is some more on the topic:

On 2/3/2012 3:34 PM, Adam Strzelecki wrote:

I have problem with LuaJIT FFI allocations in my OpenGL LuaJIT FFI framework. Similar to described in "LuaJIT - Is ffi.alloca possible?" thread from last year.

I got "mat4" (GLSL mat4 equivalent) type implemented as FFI metatype "struct { GLfloat m11 … m44; }". Everything works fine, however when I want to draw many objects, each having different model matrix, I need to pre-calculate:

   shader.modelView = view * model
   shader.modelViewProjection = projection * modelView

These two call mat4MT.__mul function that calls internally mat4() ( to create results. Unfortunately allocation takes most of the time here, all other calculations are negligible in comparison to this allocation.

After these are pre-calculated shaderMT.__newindex loads them to OpenGL using UniformMatrix4fv, which requires me to call ffi.cast(GLfloatp, matrix), as otherwise FFI complains about incompatible argument. So again it seems to do another allocation&  copy there. After I send these to OpenGL I do not store these values anywhere, so they are discarded in my program.

Is there any gentle way to avoid these allocations? If I disable this pre-calculation I get around ~1000FPS instead of 40 in my program.

Would allocation sinking that is planned for LuaJIT help in this case? In C++ I would use some classes allocated on stack, so no need for heap allocator.

Just to demonstrate that FFI allocation is two orders of magnitude slower than simple operations on locals:

local mat4 = ffi.typeof('float[16]')

local test
local start  = os.clock()
for i = 1, 20000000 do
   test = mat4(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16)
print(string.format('allocation took %f seconds', os.clock()-start))

local t1, t2, t3, t4, t5, t6, t7, t8, t9, t10, t11, t12, t13, t14, t15 = 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16
local start  = os.clock()
for i = 1, 20000000 do
   t1, t2, t3, t4, t5, t4, t7, t8, t9, t10, t11, t12, t13, t14, t15 = t1 + 1, t2 + 2, t3 + 3, t4 + 4, t5 + 5, t6 + 6, t7 + 7, t8 + 8, t9 + 9, t10 + 10, t11 + 11, t12 + 12, t13 + 13, t14 + 14, t15 + 15
print(string.format('assignment took %f seconds', os.clock()-start))

