lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


Johnson Lin wrote:
> First of all, since the pointers are not followed by GC, so obviously
> you cannot write:
> 
> a = ffi.new("int*[?]", height)
> for i = 0, height - 1 do
>   a[i] = ffi.new("int[?]", width)
> end

This would work if you kept references to the cdata objects for
all rows in parallel in a Lua table. Or only keep them in a Lua
table -- starting indexing at 0 is efficient with LuaJIT.

Arguably that's a bit tedious and not the optimal solution,
anyway.

> So I came up with this:
> 
> a = ffi.new("int["..height.."]["..width.."]")
> -- then use this variable-size a

Yes, this is the only solution for n-dimensional VLAs, because all
but the outermost array needs to have a constant size. The
outermost array can be a VLA, so "int[?][10]" would be ok, too.

> It worked too, however the perfomance seemed sub-optimal.

Umm, you didn't show the code for this? But my guess is you tried
to cache the row outside of a loop, e.g.

  for y=0,14 do
    local row = a[i] -- DON'T DO THIS!
    for x=0,14 do local b = row[x] ... end
  end

This creates intermediate reference cdata objects, which cannot be
eliminated by the JIT compiler. The ensuing GC overhead will ruin
the performance. Always do it like this with FFI data structures:

  for y=0,14 do
    for x=0,14 do local b = a[y][x] ... end -- OK!
  end

Or maybe there was some other way the intermediate references
escaped. Otherwise the 2D solution ought to be the most efficient
one.

> So I modifed that code to simulate a 2d array with 1d array:
> 
> a = ffi.new("int["..height*width.."]")
> -- then use this variable-size a, accessing it's element via a[ i*h + j ]

You can use a VLA here: a = ffi.new("int[?]", height*width)

> But still.. is this the best way to implement a 2d array with FFI for
> now? How about in the future? Or how about 3-dimensions and more? That
> would be a pain to simulate through a 1d array. Any help or hint
> appreciated.

The n-dimensional solution is currently more efficient than the
1-dimensional solution, provided the intermediate references don't
escape. It certainly won't performe worse than the 1-dimensional
solution in the future.

> And I am not very sure if my Lua version's implementation is
> near optimal or not.

Well, there's a lot to optimize. But most of that applies to the
C++ version, too:

- Either use 'uint8_t' for the elements to avoid sign-extensions
  or use 'int' for better code generation. You'll have to
  experiment a bit and use whatever is faster for your case.

- Array dimensions of size 2^k are more efficient, even if this
  leaves some elements unused. You don't need to do this for the
  outermost array, of course.

- Don't use conditionals to check for the borders. Leave a
  1-element sized border around the area and iterate only over the
  inner part. Either keep the border zero or copy it from the
  other side before starting an iteration to get wrap-around
  behavior.

- Don't use a loop to count the neighbors. Unroll this for the
  eight elements. This is both shorter and faster:

  local count = (old[y-1][x-1] + old[y-1][x] + old[y-1][x+1]) +
                (old[y][x-1]   +               old[y][x+1])   +
                (old[y+1][x-1] + old[y+1][x] + old[y+1][x+1])

  The parentheses are deliberate to help super-scalar CPUs.

- Don't use a conditional to select the ruleset -- add an offset
  to the lookup index.

- Or better yet, don't use tables for the rulesets -- use bit
  tricks (untested):

  new[y][x] = bit.band(bit.rshift(bit.lshift(old[y][x],2)+8, count), 1)

- For an int[16][16] array this generates the following machine
  code for the inner loop on x64 (probably close to optimal):

  ->LOOP:
  mov ecx, [rdx+rbp*4-0x3c]
  mov r14d, [rdx+rbp*4+0x4]
  mov r15d, [rdx+rbp*4+0x44]
  mov ebx, [rdx+rbp*4+0x8]
  shl ebx, 0x02
  add ebx, +0x08
  add r15d, [rdx+rbp*4+0x48]
  add r15d, [rdx+rbp*4+0x4c]
  add r14d, [rdx+rbp*4+0xc]
  add ecx, [rdx+rbp*4-0x38]
  add ecx, [rdx+rbp*4-0x34]
  add ecx, r14d
  add ecx, r15d
  shr ebx, cl
  and ebx, +0x01
  mov [rax+rbp*4+0x8], ebx
  add ebp, +0x01
  cmp ebp, +0x0e
  jle ->LOOP

- I'm sure there are many more tricks to speed up the Game of
  Life. There's plenty of code for this out there, take a look.

--Mike