[Date Prev][Date Next][Thread Prev][Thread Next]
- Subject: Re: Implementation of Lua and direct/context threaded code
- From: Mike Pall <mikelu-0606@...>
- Date: Thu, 1 Jun 2006 03:10:35 +0200
Grellier, Thierry wrote:
> I've attached a patch, that can allow using some kind of direct threaded
> technique to lua 5.1 (on lvm.c) for anything but i386 with gcc compiler
> (selection is done in luaconf.h and can be improved I guess, notably I
> made assumption for powerpc... but at least it shall preserve
> portability). On i386 it keeps switch/case.
Umm, this doesn't work. The "const Instruction i = ..." appears
in every block as a new local variable. Ok, easy fix: remove the
"const" from the main declaration and the "const Instruction"
redeclaration in every block.
> A quick test let me think it allows to gain up to 5%-10% on
> sparcs using some benches of
> http://shootout.alioth.debian.org/, but performs worst on i386
> (because of replicated code in BREAK I guess, and less
> registers). So if anybody wants to play with it and see how it
> performs on their system...
Well, I've looked into the x86 assembler output: GCC is smart
enough to recognize the identical code sequences and merges all
of them into a single instance. This means you still get a single
indirect branch with no advantages for branch prediction.
So you absolutely must compile lvm.c (and only this file) with
-fno-crossjumping or you won't see any effect, no matter how hard
Still ... the generated code is not faster and has gotten huge
(I-cache bloat). It's better to change the hook check to:
if (L->hookmask & (LUA_MASKLINE | LUA_MASKCOUNT)) goto activehook;
... and add the code for the uncommon execution path at the end
(might be beneficial for plain Lua, too).
The register allocator seems to have a hard time with the main
loop on a register-starved x86. Even with -fomit-frame-pointer.
And it makes some unfortunate decisions, too (like spilling ra
before the branch). Moving the StkId ra = RA(i) into every block
helps a bit. Anyway, the generated code is still messy and spills
far too many registers.
I benchmarked this on a PIII and a P4 with GCC 3.3/3.4 and -O3
-fomit-frame-pointer (plus -fno-crossjumping for lvm.c). The
numbers given are the speedup (+) or reduction (-) in percent
against stock Lua 5.1 (compiled with the same options):
Benchmark PIII P4
binarytrees +10 -6
cheapconcw +11 +12
fannkuch 0 +5
knucleotide +4 -3
mandelbrot +7 -2
nbody +6 -6
nsieve +7 -24
nsievebits +12 -5
pidigits +17 +1
recursive +15 +11
regexdna +2 -5
revcomp +7 -8
spectralnorm +3 -12
sumfile +6 -11
(All other benchmarks are +-0 because the bottleneck is elsewhere.)
Not really convincing. Especially on the P4, which has deeper
pipelines and should benefit a lot more from the fewer branch
mispredictions. But I guess its I-cache suffers badly from the
code explosion. Well ... it was worth a try.
> Regarding LuaJIT, and referring to article I mentioned (more
> details here:
> http://www.cs.toronto.edu/~bv/tcl2005/tcl2005-slides.pdf). Does
> LuaJIT use similar techniques to reduce misprediction and/or
> inline code in branches?
LuaJIT compiles to machine code, i.e. inlining all opcodes. So it
has zero dispatch overhead by definition. And it's doing a lot
more optimizations, too (like specialization or inlining library
functions). You can have a look at the assembler output with:
luajit -O -j dump somefile.lua
Just in case you want to compare the above numbers with LuaJIT:
E.g. LuaJIT is 6.74 times faster for mandelbrot and the above
gets you around 1.07 or 0.98 compared to plain Lua (which is the
reference with 1.00).