[Date Prev][Date Next][Thread Prev][Thread Next]
- Subject: Re: LuaJIT roadmap 2008 question
- From: Mike Pall <mikelu-0910@...>
- Date: Fri, 30 Oct 2009 03:28:24 +0100
Alex Davies wrote:
> I believe it means that the call info, ie saved instruction pointer,
> function base/top (to allow call stack traversal), number of tail calls,
> etc are all stored as implicit, hidden local variables.
Sure, that's the basic idea, but it goes beyond it. Actually, most
of the stuff kept in CallInfo in standard Lua is not needed. Even
LuaJIT 1.x has dropped tail call counting long ago (with zero
complaints) -- IMHO that's the job of the debugger, not the VM.
Vararg functions get an extra frame below the actual frame, which
is used to hold the variable arguments. This way the current
function is always at BASE[-1] and needs no extra pointer. TOP can
either be reconstructed on-the-fly or can be stored in L->top for
calls to C. BASE can be kept in a register or in L->base.
Ok, that still leaves us with a need for the return address (the
bytecode PC of caller), the number of expected results and a way
to restore the callers BASE. All of this can be combined into a
frame link, which is a single 32 bit value. It's stored instead of
the type tag in the upper 32 bits of BASE[-1]. The lower 32 bits
contain the function (closure).
There are two basic types of frame links, depending on the kind of
the _calling_ frame. If it's a Lua frame, then the frame link
contains the return address: a bytecode PC with the lowest 2 bits
set to zero. For all other kinds of frames, the frame link is a
delta to the previous frame: this is a multiple of 8 plus one of 6
remaining bit patterns in the lowest 3 bits. These are used to
differentiate upcalls from C, protected frames, vararg frames and
continuations (for metamethods).
Only backwards frame traversal is needed (e.g. for stack unwinding
in error handling). It starts at the topmost frame link held in
BASE[-1]. Operand A from bytecode PC[-1] (the CALL instruction) or
the delta can be used to get to the previous frame link.
The high orthogonality of the scheme really simplifies a lot of
code. A frame slot is just a regular stack slot -- the frame link
overlays the tag of the tagged value, but is always lower than any
non-number type. In fact, a regular type check considers this slot
as a number, which is convenient in many places (e.g. for the GC).
A picture says more than a thousand words:
<-------. .------------------. .------. .-- BASE
\ V \ V \ V
tag -> |delta| STR | Num | NIL | PC1 | TAB | PC2 | Num | Num | ==>
value-> |func1|GCstr| ber | |func2|GCtab|func3| ber | ber | ==>
^----- prev. BASEs -----^ current PC = PC3
> I cannot figure out why this speeds up call handling substantially.
> May it reduce cache misses?
What's so nice about this layout is that a call simply needs to
bump BASE, save the PC in BASE[-1] and load the new PC. A return
restores the PC from BASE[-1], adjusts the return value(s) and
BASE with the help of the CALL instruction in PC[-1] and then
continues executing code. This makes function calls and returns
blazingly fast. The standard library functions don't need to set
up and tear down the frame at all, except for the fallback path
(coercions or errors).
Even better, the JIT compiled code doesn't need to sync back the
frame info to the Lua stack in most cases. It wouldn't be too
costly either, since it's the same as syncing back any other stack
slot (one or two stores).
Summary: lower instruction count + less memory traffic + lower
cache usage -> faster code.
Alex Davies wrote:
> Perhaps. There's at least one other advantage - in a register poor
> architecture like x86, it allows a function to have full access to its
> callinfo without reserving a second register to point to the CallInfo stack
> (assuming a register is already reserved for the TValue stack, which makes
> sense). And even on architectures with more registers they then have to
> worry then about keeping them valid through growths/shrinks.
That's correct. The interpreter needs a single register to point
to the stack (BASE). It needs another 3 permanent registers for
the current bytecode PC, the opcode dispatch table (DISPATCH) and
the pointer to the constants of the current function (KBASE). That
leaves us 3 temporary registers on x86 which are needed to execute
the 3-operand bytecode (RA, RB and RC). Yup, there's no free
register left. =:-)
> It would also allow just one bounds test on entering a function, vs the
> current two. In Jit every instruction counts... something as minor as that
> will show up on recursion tests.
Yes, exactly. Now, the one remaining stack limit check is the most
costly step for calls (in the interpreter). Maybe I can somehow
get rid of it, too (without virtual memory tricks). :-)
> Definitely looking forward to playing around with it when it comes out ;)
Umm, Alex, please check your mail (or your spam folder). :-)