lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

Alex Bradbury wrote:
> So at its core, it's just a loop which fetches each instruction,
> decodes it and then does a switch on the opcode number. The
> performance numbers I've managed so far aren't competitive with the C
> implementation and I'm interested in suggestions for improving that.

The main loop of an interpreter is a tough job for compilers (and
CPUs). Most interpreters have been written in C, so C compilers
have been tuned for this case over the years. They still generate
mediocre code, compared to what you can achieve in assembler. But
writing an interpreter in almost any other language will generate
really, really slow code.

> Looking at luajit -jv and luajit -jdump, it's clear that the root
> trace is always aborted.

Always? I'd expect it to trace one iteration for a particular
instruction. This would be the start of a dispatch tree with side
traces for all other instructions.

Maybe something else is preventing the root trace to finish?
I have a couple of toy interpreters in Lua and they generate the
expected traces.

> It is possible I'm doing something stupid in
> my implementation, but intuitively it makes sense that a trace
> compiler would struggle with a program with this structure.

Well, it won't create stellar code, but due to the statistical
probing, the dispatch tree will be approximately sorted by
instruction frequency. This works fine for a small number of
instructions, but fails once the tree gets too big.

Another problem is that it duplicates a lot of code in each side
trace and cannot hoist it back into the root trace. A C compiler
can do that in some cases. But C doesn't need to rely so heavily
on hoisting as everything is statically typed.

> Any
> suggestions on restructuring it for better optimisation by LuaJIT? I'd
> imagine dynamic translation to generated Lua code would get some good
> results, as it would allow the compiler to build traces which cover
> sequences of code in the simulated program. However, that's not a
> direction I'm going to take right now.

Yes, that would perform much better (expect 20x at least). You'll
need to expand the operands into constants, too (cpu.regs[7] =
cpu.regs[5] etc.). This allows the compiler to disambiguate the
registers and forward them across instructions.

You can do this as a JIT compiler, it's not that hard:

Create an initially empty table holding Lua functions for each
branch target. Begin with a branch to the start address. An
__index metamethod then translates a piece of the original code
that extends until the next branch and stores the resulting Lua
function in the table.

Each function ends with a tail call to the next target(s). This
calls an existing function or triggers a new compilation. LuaJIT
recognizes tail recursion and turns it into loops, so this ought
to perform well.

Ok, so that's a simple basic block compiler. Possible variations
are extended basic blocks or even traces. :-)

> Also (to Mike), might it be worth adding a new FAQ entry to
> for a question like "My code doesn't run as fast as I think it should
> in LuaJIT. Why not?".

Well, one could write a book about this. But I have neither the
time nor the motivation for this (I'm good at code, not docs).

Maybe I'll point to -jv/-jdump. Alas, the output might be
confusing for most users (e.g. some amount of trace aborts are
perfectly normal, this is the statistical probing at work).