lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


This sounds absolutely wonderful, thanks for the detailed info!

On 01/02/2008, Mike Pall <mikelu-0802@mike.de> wrote:
> There have been several requests to publish a roadmap for LuaJIT,
> so here it is:
>
> LuaJIT 1.x
> ----------
>
> - LuaJIT 1.1.4 will be released in the next few days. This is
>  strictly a minor bug fix upgrade to catch up with Lua 5.1.3.
>
> - CoCo 1.1.4 will be released at the same time, because Lua 5.1.3
>  makes some incompatible changes to the C call depth counting
>  (basically reverse-patching these parts back to Lua 5.1.2).
>
>
> LuaJIT 2.x
> ----------
>
> I've been working on LuaJIT 2.x for quite some time now. It took
> much longer than I had expected. I'm now at the third complete
> redesign, because the first two approaches I tried didn't work
> out: 1. shadow SSA info to augment the old bytecode got too
> complex. 2. one-pass on-the-fly SSA generation doesn't work for
> Lua because of an ugly corner-case involving upvalue semantics.
>
> However I've gotten quite far with the third approach, namely a
> fast interpreter combined with a trace compiler (detailed
> description at the end of this posting).
>
> The VM has been designed from the ground up to support the new
> execution model. It still shares some components with the
> standard Lua 5.1 interpreter, but many parts have changed.
>
> Here are the most interesting changes:
>
> - The bytecode format has been redesigned. It's much more
>  orthogonal, partially specialized and tuned for faster
>  dispatch.
>
> - The bytecode interpreter has been completely rewritten in
>  hand-tuned x86 assembler. It has been optimized to reduce
>  branch prediction misses and cache misses. Scheduling has been
>  tuned and instruction-level parallelism has been improved. It
>  features extremely fast dispatch ("indirect threading") and
>  it's very compact. The commonly used parts of the code fit into
>  a single 4K page. Your current-generation out-of-order- and
>  speculative-execution CPU monster under your desk will love it.
>
> - Object references are still tagged values, but now only use
>  half the space (8 bytes). IEEE-754 doubles are overlapped with
>  other object references. The object tags have a special
>  encoding which is an unused FP NaN. Amongst other things this
>  means stacks and arrays need 50% less space. D-Caches rejoice.
>
> - The explicit frame stack (CallInfo) is gone. The frame
>  structure is implicitly stored in the regular Lua stack (tagged
>  values). This speeds up call handling substantially and also
>  simplifies the compiler (uniform treatment of slots and frames).
>
> - Exception handling has been redesigned along with the new frame
>  structure. A pcall() is almost zero-cost now.
>
> - Coroutines are completely "stackless" again. No C stack needs
>  to be allocated for their use. This is quite a departure from
>  the concept for LuaJIT 1.x -- the only drawback is that you
>  cannot yield *and return* to a C function. But you can of
>  course yield across pcall() and from iterators and metamethods
>  (which is what most people really want or need).
>
> - Many of the standard library functions are now "internal"
>  functions. The fast paths are written in assembler, the
>  fallback paths are written in C. Apart from dramatic speedups
>  this also enables the stackless nature of coroutines.
>
> Lots of other minor things have changed, like the fast string
> hash (see the backport I posted a few weeks ago).
>
> Preliminary benchmarks show that *the interpreter alone* is
> already 2x-4x faster in most cases. Embarassingly this is better
> than LuaJIT 1.x in a few benchmarks. Much higher speedups are
> still to be expected from the trace compiler.
>
> But the VM itself is nowhere near complete. Lots of rough edges
> remain. Debug hooks are not yet implemented (I want them to be
> zero-cost when not active). Other debug functionality, like
> tracebacks, already work fine. Many more library functions need
> to be internalized. And a plethora of little details still bear
> prominent TODO signs.
>
> The biggest stumbling block right now is the trace compiler and
> all the associated machinery. I guess I need to redesign trace
> recording (again), because a simple static approach to ucode
> generation doesn't seem to work out well. The x86 backend and the
> register allocator is not there yet, most optimizations are still
> in their infancy. Consider it vaporware for now.
>
> In short: no realistic ETA, yet -- "2008" is the best promise I'm
> able to give. Before you ask: prereleases are pointless right
> now, because you wouldn't get a fully working VM and I would
> drown in support requests.
>
> On the positive side, the new VM should be easier to port to
> other CPU architectures. But there's more to consider, like a
> different number type (on platforms without hardware FP) or
> different pointer sizes (x64). This will certainly complicate
> things. In any case, a port is not something I want to embark on
> before the x86 version is stable.
>
> My list of things I originally wanted to be in the first release
> has already shrunken considerably. Some things NOT to expect:
>
> - Redesigned memory management: faster allocation and
>  cache-friendly marking. Got many ideas and all the literature
>  I'd ever want to read -- not sure when I'll approach this.
>
> - Concurrency. I've got a neat idea about a specific concurrency
>  model: basically a GIL which is only unlocked for the GC, for
>  blocking on I/O and for CPU-intensive code paths. I.e. the
>  interpreter would run locked, trace compiled code paths run
>  unlocked or lock on demand. C calls which block or may take
>  lots of CPU time need to unlock explicitly. I don't know
>  whether this works out in practice though.
>
> - A compiler for pattern matches (regexes). I've got the basic
>  design and some code, but it doesn't fit the current execution
>  model anymore. Not that important, so it'll have to wait.
>
>
> How a trace compiler works
> --------------------------
>
> The LuaJIT 2.x VM operates in four phases:
> 1. Interpretation
> 2. Interpretation + Trace Recording
> 3. Trace Compilation
> 4. Trace Execution
>
> Here's a simplified description of the different phases:
>
> 1. Interpret the bytecode and keep running statistics on loops
> and calls. As soon as certain thresholds are reached, the current
> code path is considered "hot" and the VM proceeds to phase 2 and
> starts recording.
>
> 2. Keep on interpreting the bytecode, but at the same time record
> its actions. This involves not just the bytecodes themselves, but
> also the individual steps taken to execute them (e.g. think about
> a table lookup which might involve metamethods). The relatively
> coarse bytecode is translated into a form of SSA-based microcode
> (ucode) which captures much more detail.
>
> Recording strictly follows the control flow at runtime, which
> effectively flattens only the used parts of the code to a linear
> "trace". Dead code paths are never recorded, just like uncommon
> ("cold") paths. Function calls are treated just like any other
> control flow transition and tracing continues across them (which
> is equivalent to inlining in a traditional compiler).
>
> Conditional branches, type checks, function dispatch and other
> runtime decisions cause "guards" to be inserted into the trace.
> These guarantee an exit of the trace whenever their precondition
> turns out not to be true in a future execution of a trace.
>
> A really neat consequence is that for a dynamic language (such as
> Lua) a trace also captures a particular runtime specialization of
> operations and types. A trace is like a statically typed variant
> of the code and is well suited to optimization (see below).
>
> Recording is stopped under certain conditions. Some conditions
> cause an abort, e.g. if an error is raised or some particular
> (hopefully uncommon) combination of events cannot be handled. In
> this case the collected trace is thrown away and the VM proceeds
> again to phase 1 and continues interpreting the code.
>
> Recording is also stopped whenever the trace reaches certain
> points in the control flow. E.g. when the trace start point is
> hit again (next loop iteration). In this case the trace is
> considered complete and the VM proceeds to phase 3 to compile it.
>
> 3. A completed trace is optimized and compiled to machine code.
> Since the trace is just a linear stream of statically typed
> SSA-based ucode, all the traditional compiler optimizations
> apply. I'm only listing the most useful ones here:
>
> - Constant propagation and constant folding replace runtime
>  operations with their constant results. Guards depending on
>  constants can be completely eliminated. Runtime specialization
>  and inlining create many opportunities for this optimization.
>  E.g. the return address from an inlined function is a constant
>  (the call is in the same trace) so the guard can be eliminated.
>
> - CSE (common subexpression elimination) eliminates redundancy in
>  the form of repeated expressions which calculate the same
>  result. Most of these are not caused by the programmer (e.g.
>  the classic (a+b)*(a+b) example) but by the translation to
>  ucode. The most common targets for CSE are repeated guards
>  (e.g. type checks) and address translations for table lookups.
>
> - Loop-invariant code hosting moves operations out of a loop
>  which only depend on inputs outside of the loop. This is most
>  useful to move guards resulting from method resolving and
>  function dispatch out of a loop. E.g. calling math.abs() can be
>  reduced to a single machine code instruction inside the loop
>  plus a check before the start of the loop that math.abs still
>  resolves to the internal function originally handling it.
>
> Optimization often substantially reduces the number of ucodes in
> a trace. Converting the remaining ucode to machine code is pretty
> straightforward because it's just a linear stream of instructions.
>
> [This is a bit over-simplified: traces can extend existing traces
> to form a trace tree and guards are converted to branches to
> compensation code for returning to the interpreter. There are
> more complications under the hood.]
>
> Once a trace has been compiled, it's registered with its start
> point. Whenever the interpreter hits this part of the code again,
> it proceeds with phase 4 and directly runs the machine code. In
> the case of a loop this is done right after compilation.
>
> 4. The compiled machine code resulting from a trace is run. If
> it's truly a "hot" trace, the compiled code will run often enough
> to outweigh the effort of optimizing and compiling it.
>
> A guard may occasionally cause a trace to exit. This can happen
> when a different control-flow path is taken or a type check or
> bounds check fails and so on. In this case compensation code is
> called to prepare the state for returning to the interpreter.
>
> Depending on the kind of guard, the VM either switches to phase 1
> or phase 2. Errors force a return to the interpreter. Most other
> guards start another trace which in turn extends the existing
> trace.
>
> Literature:
>
> Descriptions of trace compilation employed on bytecodes are here:
>
> - Incremental Dynamic Code Generation with Trace Trees
>  A. Gal and M. Franz, 2006
>  http://www.ics.uci.edu/~franz/Site/pubs-pdf/ICS-TR-06-16.pdf
>
> - HotpathVM: An Effective JIT for Resource-constrained Devices
>  A. Gal, M. Franz, C. W. Probst, 2006
>  http://www2.imm.dtu.dk/pubdb/views/publication_details.php?id=4746
>
> The technique originated in system emulators doing on-the-fly
> binary-to-binary translation. A good reference is here:
>
> - Dynamo: A Transparent Dynamic Optimization System
>  V. Bala, E. Duesterwald, S. Banerjia, 2000
>  http://citeseer.ist.psu.edu/bala00dynamo.html
>
> Popular software using this technique are QEMU and Valgrind.
> Intel's Netburst architecture (Pentium 4) employs a trace cache
> to store a copy of the x86 instruction stream translated by the
> instruction decoder to RISC-like micro-ops.
>
> --Mike
>