[Date Prev][Date Next][Thread Prev][Thread Next]
- Subject: Re: Suggestions on implementing an efficient instruction set simulator in LuaJIT2
- From: Alex Bradbury <asb@...>
- Date: Tue, 15 Feb 2011 16:11:29 +0000
On 15 February 2011 01:33, Mike Pall <email@example.com> wrote:
> Each function ends with a tail call to the next target(s). This
> calls an existing function or triggers a new compilation. LuaJIT
> recognizes tail recursion and turns it into loops, so this ought
> to perform well.
Thank you for the suggestion. For some reason I didn't consider using
tail calls for control flow, but this makes it very straight-forward.
An initial prototype implementation with a number of obvious further
opportunities for speed improvement can already beat the C instruction
set simulator on some benchmarks.
I've found that one part of the structure of my program completely
throws off the compiler. For memory access, I have functions like the
local function check_mem_range(addr)
if (addr >= cpu.memsize) then
warn("word read/write from 0x%x outside memory range\n", addr)
cpu.exception = C.SIGSEGV
local function check_mem_alignment(addr, mask)
if (bit.band(addr, mask) ~= 0) then
warn("word read/write from unaligned address: 0x%x\n", addr)
cpu.exception = C.SIGBUS
if not check_mem_range(addr) then return 0 end
if not check_mem_alignment(addr, 3) then return 0 end
return ffi.cast(int32p_t, cpu.memory+addr)
-- and similar for wlat, rhat, rbat etc
Obviously now I'm doing run-time generation of Lua code it would make
a lot more sense to inline all this into the generated opcode body.
When writing this code initially, I had hoped that LuaJIT would just
inline check_mem_alignment and check_mem_range. But with the standard
heuristics the presence of these calls means it fails to generate any
acceptable traces, as far as I can see due to 'too many snapshots'.
To give an idea of what a massive difference this makes, consider the
following numbers from my test program (the simulated code is a naive
fibonacci implementation). Using my prototype codegen and including
the check_mem calls it takes ~2 minutes to run. Just adding
-Omaxsnap=200 (up from the default 100) results in a drastically
reduced 1.1-1.4 second runtime. Commenting out the calls results in ~1
second runtime with default parameters.
I'm very pleased with the performance numbers I've had so far (though
recognise fib is something of a best case), and I've hardly begun to
fix the cases where unnecessary work is being done. I'm only posting
at this point as I thought it was interesting just how much difference
the maxsnap parameter can make in this case.