lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


Greg wrote:
> I've just been reading about the advances the SquirrelFish is making.  It seems
> like they have learned a lot from Lua.  It is fascinating to see how this
> javascript implementation is becoming much faster.  It seems that they are
> employing many advanced optimizations.  It reminds me a lot of LuaJIT.

Umm, but SquirrelFish is an interpreter and LuaJIT is a JIT
compiler. :-)

> One of the things they mentioned is "direct-threading".  Does Lua support direct
> threading? Just curious.

Well ... terminology is a bit weak in this area. Instead here's a
short summary of the relevant internals of the various VMs:

[Note that all of the following VMs are register-based, so they
usually need two or more operands for each instruction.]

The interpreter used in the Lua 5.1 VM has 32 bit instructions
with a 6 bit opcode field and 2 or 3 operands with 8/9/9 or 8/18
bits. It uses a centralized switch-based dispatch (i.e. a table
lookup) based on the opcode field.

SquirrelFish has 64 bit instructions with a 32 bit opcode field
and a 32 bit operand field. Additional operands are taken from
successive instructions. If the compiler supports it (e.g. GCC's
labels-as-values feature), the opcode field is a void * pointing
to the C code implementing this particular operation. It uses
decentralized dispatch, i.e. a goto *vPC->u.opcode after every
operation. It falls back to switch-based dispatch if the compiler
doesn't support computed gotos.

There have been various attempts to change Lua's interpreter loop
to use computed gotos (still using a table-based lookup). None have
shown substantial speedups. The most likely reason is that GCC's
register allocator does not deal well with the resulting N->N
control-flow and the gain from the reduced branch mispredictions is
too small since register-based VMs have 'heavy' instructions.

LuaJIT 1.x is very close to the original Lua VM and contains the
same interpreter. But it JIT-compiles all methods on their first
invocation, so the interpreter is effectively unused. Bytecode is
translated more or less 1:1 to machine code with light optimizations.

LuaJIT 2.x (not out yet) uses a different VM. The bytecode has an 8
bit opcode field and 2 or 3 operands with 8/8/8 or 8/16 bits. The
interpreter is written in hand-coded assembler. Dispatch is
decentralized at the end of every operation using an indirect jump
through a central table, indexed by the opcode. The dispatch table
can be patched on-the-fly to support debugging and trace recording.

The LJ2 interpreter by itself is roughly 2x-4x faster than Lua's
interpreter. But this is a result of many different optimizations.
It cannot be attributed solely to a single change, such as the
different dispatch method used.

LJ2 additionally employs a trace-based JIT compiler. Frequently
executed parts of the bytecode are recorded into an SSA-based
intermediate representation (IR) by following the control-flow
(trace recording). The resulting trace is highly optimized and
compiled to machine code.

--Mike