[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Re: How to correctly manage a multi-dimension array with LuaJIT FFI ?
- From: Mike Pall <mikelu-1102@...>
- Date: Wed, 9 Feb 2011 18:27:54 +0100
Johnson Lin wrote:
> First of all, since the pointers are not followed by GC, so obviously
> you cannot write:
>
> a = ffi.new("int*[?]", height)
> for i = 0, height - 1 do
> a[i] = ffi.new("int[?]", width)
> end
This would work if you kept references to the cdata objects for
all rows in parallel in a Lua table. Or only keep them in a Lua
table -- starting indexing at 0 is efficient with LuaJIT.
Arguably that's a bit tedious and not the optimal solution,
anyway.
> So I came up with this:
>
> a = ffi.new("int["..height.."]["..width.."]")
> -- then use this variable-size a
Yes, this is the only solution for n-dimensional VLAs, because all
but the outermost array needs to have a constant size. The
outermost array can be a VLA, so "int[?][10]" would be ok, too.
> It worked too, however the perfomance seemed sub-optimal.
Umm, you didn't show the code for this? But my guess is you tried
to cache the row outside of a loop, e.g.
for y=0,14 do
local row = a[i] -- DON'T DO THIS!
for x=0,14 do local b = row[x] ... end
end
This creates intermediate reference cdata objects, which cannot be
eliminated by the JIT compiler. The ensuing GC overhead will ruin
the performance. Always do it like this with FFI data structures:
for y=0,14 do
for x=0,14 do local b = a[y][x] ... end -- OK!
end
Or maybe there was some other way the intermediate references
escaped. Otherwise the 2D solution ought to be the most efficient
one.
> So I modifed that code to simulate a 2d array with 1d array:
>
> a = ffi.new("int["..height*width.."]")
> -- then use this variable-size a, accessing it's element via a[ i*h + j ]
You can use a VLA here: a = ffi.new("int[?]", height*width)
> But still.. is this the best way to implement a 2d array with FFI for
> now? How about in the future? Or how about 3-dimensions and more? That
> would be a pain to simulate through a 1d array. Any help or hint
> appreciated.
The n-dimensional solution is currently more efficient than the
1-dimensional solution, provided the intermediate references don't
escape. It certainly won't performe worse than the 1-dimensional
solution in the future.
> And I am not very sure if my Lua version's implementation is
> near optimal or not.
Well, there's a lot to optimize. But most of that applies to the
C++ version, too:
- Either use 'uint8_t' for the elements to avoid sign-extensions
or use 'int' for better code generation. You'll have to
experiment a bit and use whatever is faster for your case.
- Array dimensions of size 2^k are more efficient, even if this
leaves some elements unused. You don't need to do this for the
outermost array, of course.
- Don't use conditionals to check for the borders. Leave a
1-element sized border around the area and iterate only over the
inner part. Either keep the border zero or copy it from the
other side before starting an iteration to get wrap-around
behavior.
- Don't use a loop to count the neighbors. Unroll this for the
eight elements. This is both shorter and faster:
local count = (old[y-1][x-1] + old[y-1][x] + old[y-1][x+1]) +
(old[y][x-1] + old[y][x+1]) +
(old[y+1][x-1] + old[y+1][x] + old[y+1][x+1])
The parentheses are deliberate to help super-scalar CPUs.
- Don't use a conditional to select the ruleset -- add an offset
to the lookup index.
- Or better yet, don't use tables for the rulesets -- use bit
tricks (untested):
new[y][x] = bit.band(bit.rshift(bit.lshift(old[y][x],2)+8, count), 1)
- For an int[16][16] array this generates the following machine
code for the inner loop on x64 (probably close to optimal):
->LOOP:
mov ecx, [rdx+rbp*4-0x3c]
mov r14d, [rdx+rbp*4+0x4]
mov r15d, [rdx+rbp*4+0x44]
mov ebx, [rdx+rbp*4+0x8]
shl ebx, 0x02
add ebx, +0x08
add r15d, [rdx+rbp*4+0x48]
add r15d, [rdx+rbp*4+0x4c]
add r14d, [rdx+rbp*4+0xc]
add ecx, [rdx+rbp*4-0x38]
add ecx, [rdx+rbp*4-0x34]
add ecx, r14d
add ecx, r15d
shr ebx, cl
and ebx, +0x01
mov [rax+rbp*4+0x8], ebx
add ebp, +0x01
cmp ebp, +0x0e
jle ->LOOP
- I'm sure there are many more tricks to speed up the Game of
Life. There's plenty of code for this out there, take a look.
--Mike