[Date Prev][Date Next][Thread Prev][Thread Next]
- Subject: Re: LuaJIT roadmap 2008
- From: "Mark Meijer" <meijer78@...>
- Date: Fri, 1 Feb 2008 21:50:57 +0100
This sounds absolutely wonderful, thanks for the detailed info!
On 01/02/2008, Mike Pall <firstname.lastname@example.org> wrote:
> There have been several requests to publish a roadmap for LuaJIT,
> so here it is:
> LuaJIT 1.x
> - LuaJIT 1.1.4 will be released in the next few days. This is
> strictly a minor bug fix upgrade to catch up with Lua 5.1.3.
> - CoCo 1.1.4 will be released at the same time, because Lua 5.1.3
> makes some incompatible changes to the C call depth counting
> (basically reverse-patching these parts back to Lua 5.1.2).
> LuaJIT 2.x
> I've been working on LuaJIT 2.x for quite some time now. It took
> much longer than I had expected. I'm now at the third complete
> redesign, because the first two approaches I tried didn't work
> out: 1. shadow SSA info to augment the old bytecode got too
> complex. 2. one-pass on-the-fly SSA generation doesn't work for
> Lua because of an ugly corner-case involving upvalue semantics.
> However I've gotten quite far with the third approach, namely a
> fast interpreter combined with a trace compiler (detailed
> description at the end of this posting).
> The VM has been designed from the ground up to support the new
> execution model. It still shares some components with the
> standard Lua 5.1 interpreter, but many parts have changed.
> Here are the most interesting changes:
> - The bytecode format has been redesigned. It's much more
> orthogonal, partially specialized and tuned for faster
> - The bytecode interpreter has been completely rewritten in
> hand-tuned x86 assembler. It has been optimized to reduce
> branch prediction misses and cache misses. Scheduling has been
> tuned and instruction-level parallelism has been improved. It
> features extremely fast dispatch ("indirect threading") and
> it's very compact. The commonly used parts of the code fit into
> a single 4K page. Your current-generation out-of-order- and
> speculative-execution CPU monster under your desk will love it.
> - Object references are still tagged values, but now only use
> half the space (8 bytes). IEEE-754 doubles are overlapped with
> other object references. The object tags have a special
> encoding which is an unused FP NaN. Amongst other things this
> means stacks and arrays need 50% less space. D-Caches rejoice.
> - The explicit frame stack (CallInfo) is gone. The frame
> structure is implicitly stored in the regular Lua stack (tagged
> values). This speeds up call handling substantially and also
> simplifies the compiler (uniform treatment of slots and frames).
> - Exception handling has been redesigned along with the new frame
> structure. A pcall() is almost zero-cost now.
> - Coroutines are completely "stackless" again. No C stack needs
> to be allocated for their use. This is quite a departure from
> the concept for LuaJIT 1.x -- the only drawback is that you
> cannot yield *and return* to a C function. But you can of
> course yield across pcall() and from iterators and metamethods
> (which is what most people really want or need).
> - Many of the standard library functions are now "internal"
> functions. The fast paths are written in assembler, the
> fallback paths are written in C. Apart from dramatic speedups
> this also enables the stackless nature of coroutines.
> Lots of other minor things have changed, like the fast string
> hash (see the backport I posted a few weeks ago).
> Preliminary benchmarks show that *the interpreter alone* is
> already 2x-4x faster in most cases. Embarassingly this is better
> than LuaJIT 1.x in a few benchmarks. Much higher speedups are
> still to be expected from the trace compiler.
> But the VM itself is nowhere near complete. Lots of rough edges
> remain. Debug hooks are not yet implemented (I want them to be
> zero-cost when not active). Other debug functionality, like
> tracebacks, already work fine. Many more library functions need
> to be internalized. And a plethora of little details still bear
> prominent TODO signs.
> The biggest stumbling block right now is the trace compiler and
> all the associated machinery. I guess I need to redesign trace
> recording (again), because a simple static approach to ucode
> generation doesn't seem to work out well. The x86 backend and the
> register allocator is not there yet, most optimizations are still
> in their infancy. Consider it vaporware for now.
> In short: no realistic ETA, yet -- "2008" is the best promise I'm
> able to give. Before you ask: prereleases are pointless right
> now, because you wouldn't get a fully working VM and I would
> drown in support requests.
> On the positive side, the new VM should be easier to port to
> other CPU architectures. But there's more to consider, like a
> different number type (on platforms without hardware FP) or
> different pointer sizes (x64). This will certainly complicate
> things. In any case, a port is not something I want to embark on
> before the x86 version is stable.
> My list of things I originally wanted to be in the first release
> has already shrunken considerably. Some things NOT to expect:
> - Redesigned memory management: faster allocation and
> cache-friendly marking. Got many ideas and all the literature
> I'd ever want to read -- not sure when I'll approach this.
> - Concurrency. I've got a neat idea about a specific concurrency
> model: basically a GIL which is only unlocked for the GC, for
> blocking on I/O and for CPU-intensive code paths. I.e. the
> interpreter would run locked, trace compiled code paths run
> unlocked or lock on demand. C calls which block or may take
> lots of CPU time need to unlock explicitly. I don't know
> whether this works out in practice though.
> - A compiler for pattern matches (regexes). I've got the basic
> design and some code, but it doesn't fit the current execution
> model anymore. Not that important, so it'll have to wait.
> How a trace compiler works
> The LuaJIT 2.x VM operates in four phases:
> 1. Interpretation
> 2. Interpretation + Trace Recording
> 3. Trace Compilation
> 4. Trace Execution
> Here's a simplified description of the different phases:
> 1. Interpret the bytecode and keep running statistics on loops
> and calls. As soon as certain thresholds are reached, the current
> code path is considered "hot" and the VM proceeds to phase 2 and
> starts recording.
> 2. Keep on interpreting the bytecode, but at the same time record
> its actions. This involves not just the bytecodes themselves, but
> also the individual steps taken to execute them (e.g. think about
> a table lookup which might involve metamethods). The relatively
> coarse bytecode is translated into a form of SSA-based microcode
> (ucode) which captures much more detail.
> Recording strictly follows the control flow at runtime, which
> effectively flattens only the used parts of the code to a linear
> "trace". Dead code paths are never recorded, just like uncommon
> ("cold") paths. Function calls are treated just like any other
> control flow transition and tracing continues across them (which
> is equivalent to inlining in a traditional compiler).
> Conditional branches, type checks, function dispatch and other
> runtime decisions cause "guards" to be inserted into the trace.
> These guarantee an exit of the trace whenever their precondition
> turns out not to be true in a future execution of a trace.
> A really neat consequence is that for a dynamic language (such as
> Lua) a trace also captures a particular runtime specialization of
> operations and types. A trace is like a statically typed variant
> of the code and is well suited to optimization (see below).
> Recording is stopped under certain conditions. Some conditions
> cause an abort, e.g. if an error is raised or some particular
> (hopefully uncommon) combination of events cannot be handled. In
> this case the collected trace is thrown away and the VM proceeds
> again to phase 1 and continues interpreting the code.
> Recording is also stopped whenever the trace reaches certain
> points in the control flow. E.g. when the trace start point is
> hit again (next loop iteration). In this case the trace is
> considered complete and the VM proceeds to phase 3 to compile it.
> 3. A completed trace is optimized and compiled to machine code.
> Since the trace is just a linear stream of statically typed
> SSA-based ucode, all the traditional compiler optimizations
> apply. I'm only listing the most useful ones here:
> - Constant propagation and constant folding replace runtime
> operations with their constant results. Guards depending on
> constants can be completely eliminated. Runtime specialization
> and inlining create many opportunities for this optimization.
> E.g. the return address from an inlined function is a constant
> (the call is in the same trace) so the guard can be eliminated.
> - CSE (common subexpression elimination) eliminates redundancy in
> the form of repeated expressions which calculate the same
> result. Most of these are not caused by the programmer (e.g.
> the classic (a+b)*(a+b) example) but by the translation to
> ucode. The most common targets for CSE are repeated guards
> (e.g. type checks) and address translations for table lookups.
> - Loop-invariant code hosting moves operations out of a loop
> which only depend on inputs outside of the loop. This is most
> useful to move guards resulting from method resolving and
> function dispatch out of a loop. E.g. calling math.abs() can be
> reduced to a single machine code instruction inside the loop
> plus a check before the start of the loop that math.abs still
> resolves to the internal function originally handling it.
> Optimization often substantially reduces the number of ucodes in
> a trace. Converting the remaining ucode to machine code is pretty
> straightforward because it's just a linear stream of instructions.
> [This is a bit over-simplified: traces can extend existing traces
> to form a trace tree and guards are converted to branches to
> compensation code for returning to the interpreter. There are
> more complications under the hood.]
> Once a trace has been compiled, it's registered with its start
> point. Whenever the interpreter hits this part of the code again,
> it proceeds with phase 4 and directly runs the machine code. In
> the case of a loop this is done right after compilation.
> 4. The compiled machine code resulting from a trace is run. If
> it's truly a "hot" trace, the compiled code will run often enough
> to outweigh the effort of optimizing and compiling it.
> A guard may occasionally cause a trace to exit. This can happen
> when a different control-flow path is taken or a type check or
> bounds check fails and so on. In this case compensation code is
> called to prepare the state for returning to the interpreter.
> Depending on the kind of guard, the VM either switches to phase 1
> or phase 2. Errors force a return to the interpreter. Most other
> guards start another trace which in turn extends the existing
> Descriptions of trace compilation employed on bytecodes are here:
> - Incremental Dynamic Code Generation with Trace Trees
> A. Gal and M. Franz, 2006
> - HotpathVM: An Effective JIT for Resource-constrained Devices
> A. Gal, M. Franz, C. W. Probst, 2006
> The technique originated in system emulators doing on-the-fly
> binary-to-binary translation. A good reference is here:
> - Dynamo: A Transparent Dynamic Optimization System
> V. Bala, E. Duesterwald, S. Banerjia, 2000
> Popular software using this technique are QEMU and Valgrind.
> Intel's Netburst architecture (Pentium 4) employs a trace cache
> to store a copy of the x86 instruction stream translated by the
> instruction decoder to RISC-like micro-ops.