lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


Gavin Wraith wrote:
> Lots of CPU instruction sets
> have LOADQ, ADDQ, SUBQ instructions that load/add/subtract
> small constants encoded as part of the instruction
> itself. I was slightly surprised to see that the Lua 5
> VM instruction set did not have any, but that all
> constants were fetched from a constants pool.

The tradeoffs for instruction design are different for CPUs
vs. VMs. And the way modern CPUs work has implications for
VM design, too. Sure there are similarities, but the design
constraints are not easily comparable.

> Although this makes for conceptual simplicity it means that
> all constant loading involves memory fetches. On a
> CPU architecture like the ARM, and I guess on any
> RISC design, efficiency means minimising memory fetches.

Yes, the speed mismatch between CPU execution units and main
memory has widened in the past years. But don't forget about

The main design goal is to reduce fetches from high latency
memory. A fetch from the L1 cache is comparably cheap. And
the cost for the fetch almost disappears since the fetch and
the execution units work in parallel.

In this case the constant pool is very likely to be in the
L1 cache (it is consecutive, too). So you basically trade off
a little reduction in fetch bandwidth against an increase
in I-cache usage and an additional int/double conversion.
I'm not sure this will pay off.

And you gotta be really careful not to swamp the I-cache.
E.g. the Python VM is known to have this problem. Seemingly
innocuous changes to the core VM loop have caused huge
performance losses because the I-cache started thrashing.
Changing compiler optimization flags has shown similar effects.

Sure, the Lua VM won't hit this wall anytime soon. Mainly
because the core VM loop and other frequently needed code
(like the hashtable stuff) is almost an order of magnitude
smaller. But add a few library calls and you'll soon get closer
to the I-cache limits (sprintf or strtod can _really_ ruin
your day).

[If you want to do your own measurements: Have a look at
cachegrind (part of valgrind) and kcachegrind (part of kdesdk).]

While we are talking about caches: Stack VMs may benefit from
software caching the TOS, but register VMs (such as Lua 5) won't.
The hardware caches work just fine. Basically register VMs are
better suited for modern CPUs (higher use of instruction-level
parallelism is another plus).

> There seems to be plenty of space for Q-versions of
> some of the Lua VM opcodes, and it would not be hard
> to extend the parser to recognize the small number literals
> that should invoke them. Has anybody done this, or is it
> reckoned that the effort is not worth the likely gains?

You're welcome to try this. It shouldn't be too difficult.
But note that a good deal of the constant fetches come from
the three-operand arithmetic instructions. You may need quite
a few additional instructions to cover all interesting cases.

And of course you have to pick your benchmarks carefully. Looking
through my Lua sources there is hardly any use of numeric constants
in the timing critical inner loops. Most stuff is held in local
variables (i.e. it generates (cached) references to the Lua stack).
Well ... I doubt this optimization would make much of a difference
for my code, at least. YMMV.

I think there are better opportunities to optimize the Lua VM:

- Reduce unpredictable branches. There are only a few left
  in the critical paths: the main VM loop switch and some
  instructions combining variable and constant operands
  (RKB/RKC). A few others can be found in ldo.c (luaD_precall)
  and the hashtable code (lvm.c and ltable.c).

- Reduce instruction decoding overhead: predecoding during
  loading or semi-JIT (translate the decoding step only) may
  be interesting areas to explore.

- Shorten the critical code paths and move more stuff into
  CPU registers (esp. for register-starved CPUs like x86):
  move the hookmask checks into a shadow loop, optimize the
  instruction dispatch (no bounds checks, direct threading),
  inline Lua/C function call code and more of the hashtable code.

- Write a true JIT with dynamic type inferencing, massive
  inlining and all tricks you can find. Ok, this is a major
  undertaking (and I'm not sure it would pay off).

Don't forget about optimizing VM code generation, too. There
was a thread two weeks ago that gave some hints (search for
"Optimizing off-line Lua compiler" in the archives).

But one final caveat: it seems most people using Lua have
already moved the timing critical parts of their code to C/C++
functions. Optimizing the Lua VM won't buy you that much then.
Always benchmark first, then take the low-hanging fruit and
repeat until you are satisfied with the result.