lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

On 15 February 2011 01:33, Mike Pall <> wrote:
> Each function ends with a tail call to the next target(s). This
> calls an existing function or triggers a new compilation. LuaJIT
> recognizes tail recursion and turns it into loops, so this ought
> to perform well.

Thank you for the suggestion. For some reason I didn't consider using
tail calls for control flow, but this makes it very straight-forward.
An initial prototype implementation with a number of obvious further
opportunities for speed improvement can already beat the C instruction
set simulator on some benchmarks.

I've found that one part of the structure of my program completely
throws off the compiler. For memory access, I have functions like the

local function check_mem_range(addr)
  if (addr >= cpu.memsize) then
    warn("word read/write from 0x%x outside memory range\n", addr)
    cpu.exception = C.SIGSEGV
    return false
  return true

local function check_mem_alignment(addr, mask)
  if (, mask) ~= 0) then
    warn("word read/write from unaligned address: 0x%x\n", addr)
    cpu.exception = C.SIGBUS
    return false
  return true

function rlat(addr)
  if not check_mem_range(addr) then return 0 end
  if not check_mem_alignment(addr, 3) then return 0 end

  return ffi.cast(int32p_t, cpu.memory+addr)[0]
-- and similar for wlat, rhat, rbat etc

Obviously now I'm doing run-time generation of Lua code it would make
a lot more sense to inline all this into the generated opcode body.
When writing this code initially, I had hoped that LuaJIT would just
inline check_mem_alignment and check_mem_range. But with the standard
heuristics the presence of these calls means it fails to generate any
acceptable traces, as far as I can see due to 'too many snapshots'.

To give an idea of what a massive difference this makes, consider the
following numbers from my test program (the simulated code is a naive
fibonacci implementation). Using my prototype codegen and including
the check_mem calls it takes ~2 minutes to run. Just adding
-Omaxsnap=200 (up from the default 100) results in a drastically
reduced 1.1-1.4 second runtime. Commenting out the calls results in ~1
second runtime with default parameters.

I'm very pleased with the performance numbers I've had so far (though
recognise fib is something of a best case), and I've hardly begun to
fix the cases where unnecessary work is being done. I'm only posting
at this point as I thought it was interesting just how much difference
the maxsnap parameter can make in this case.