lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


Grellier, Thierry wrote:
> I've attached a patch, that can allow using some kind of direct threaded
> technique to lua 5.1 (on lvm.c) for anything but i386 with gcc compiler
> (selection is done in luaconf.h and can be improved I guess, notably I
> made assumption for powerpc... but at least it shall preserve
> portability). On i386 it keeps switch/case.

Umm, this doesn't work. The "const Instruction i = ..." appears
in every block as a new local variable. Ok, easy fix: remove the
"const" from the main declaration and the "const Instruction"
redeclaration in every block.

> A quick test let me think it allows to gain up to 5%-10% on
> sparcs using some benches of
>, but performs worst on i386
> (because of replicated code in BREAK I guess, and less
> registers). So if anybody wants to play with it and see how it
> performs on their system...

Well, I've looked into the x86 assembler output: GCC is smart
enough to recognize the identical code sequences and merges all
of them into a single instance. This means you still get a single
indirect branch with no advantages for branch prediction.

So you absolutely must compile lvm.c (and only this file) with
-fno-crossjumping or you won't see any effect, no matter how hard
you try.

Still ... the generated code is not faster and has gotten huge
(I-cache bloat). It's better to change the hook check to:

  if (L->hookmask & (LUA_MASKLINE | LUA_MASKCOUNT)) goto activehook; 

... and add the code for the uncommon execution path at the end
(might be beneficial for plain Lua, too).

The register allocator seems to have a hard time with the main
loop on a register-starved x86. Even with -fomit-frame-pointer.
And it makes some unfortunate decisions, too (like spilling ra
before the branch). Moving the StkId ra = RA(i) into every block
helps a bit. Anyway, the generated code is still messy and spills
far too many registers.

I benchmarked this on a PIII and a P4 with GCC 3.3/3.4 and -O3
-fomit-frame-pointer (plus -fno-crossjumping for lvm.c). The
numbers given are the speedup (+) or reduction (-) in percent
against stock Lua 5.1 (compiled with the same options):

Benchmark      PIII   P4
binarytrees    +10    -6    
cheapconcw     +11   +12    
fannkuch         0    +5    
knucleotide     +4    -3    
mandelbrot      +7    -2    
nbody           +6    -6    
nsieve          +7   -24    
nsievebits     +12    -5    
pidigits       +17    +1    
recursive      +15   +11    
regexdna        +2    -5    
revcomp         +7    -8    
spectralnorm    +3   -12    
sumfile         +6   -11    

(All other benchmarks are +-0 because the bottleneck is elsewhere.)

Not really convincing. Especially on the P4, which has deeper
pipelines and should benefit a lot more from the fewer branch
mispredictions. But I guess its I-cache suffers badly from the
code explosion. Well ... it was worth a try.

> Regarding LuaJIT, and referring to article I mentioned (more
> details here:
> Does
> LuaJIT use similar techniques to reduce misprediction and/or
> inline code in branches?

LuaJIT compiles to machine code, i.e. inlining all opcodes. So it
has zero dispatch overhead by definition. And it's doing a lot
more optimizations, too (like specialization or inlining library
functions). You can have a look at the assembler output with:
  luajit -O -j dump somefile.lua

Just in case you want to compare the above numbers with LuaJIT:
E.g. LuaJIT is 6.74 times faster for mandelbrot and the above
gets you around 1.07 or 0.98 compared to plain Lua (which is the
reference with 1.00).