[Date Prev][Date Next][Thread Prev][Thread Next]
- Subject: Re: [ANN] llvm-lua 1.0
- From: Mike Pall <mikelu-0906@...>
- Date: Wed, 3 Jun 2009 05:26:12 +0200
Robert G. Jakabosky wrote:
> One way llvm-lua could improve the speed of static compiled code would be to
> add support for recording hints during runtime and writting those hints to a
> file that would then be passed to the static compiler.
Sure, feedback-directed optimization (FDO) would probably help a
lot. But static compilers face some inherent difficulties when
compiling a dynamic language:
- Dynamic languages generate tons of code paths, even for simple
operations. A static compiler has to compile *all* code paths.
Apart from the overhead, this pessimizes the fast code paths,
too (esp. PRE and register allocation). A JIT compiler only
generates code for paths that are actually executed (this may
also differ between runs).
- Even with FDO, static compilers have to be very conservative
about inlining and specialization to limit code growth. A JIT
compiler can be more aggressive, because it has to deal with a
much lower number of code paths.
- Not even the best C compilers are able to hoist complex control
structures out of loops. Once you lower (say) a load from a hash
table (a loop following the hash chain) to the LLVM IR, you lose
the original semantics. This usually means you cannot optimize
it or eliminate it, either.
That's not to discourage you from your work -- just to point out
some difficulties you'll be facing with llvm-lua.
The following trivial Lua function is a nice optimization challenge:
function f(x, n)
for i=1,n do x = math.abs(x) end
The inner loop has two hash table loads (_G["math"], math["abs"]),
a function dispatch and a call to a standard library function.
Here's the corresponding checklist for a Lua compiler in
increasing order of complexity:
yes yes Type-specializes the hash table loads (object and key type).
yes* yes Inlines both hash table loads (* partially inlined by LJ1).
no yes Specializes the loads to the hash slot (omit chain loop).
yes yes Specializes to a monomorphic function dispatch for abs().
yes yes Inlines the abs() function.
no yes Hoists both hash table loads out of the loop.
no yes Hoists the abs() out of the loop (abs(abs(x)) ==> abs(x)).
no yes Emits an empty loop (could be optimized away, too).
Now try to get 8x yes with a static compiler, while preserving the
full Lua semantics (e.g. somebody might do math.abs = math.sin or
_G might have a metatable etc.). Good luck! :-)
> Below are the results of running scimark-2008-01-22.lua  for a performance
> # taskset -c 1 /usr/bin/lua scimark-2008-01-22.lua large
The large dataset is an out-of-cache problem and cache thrashing
(esp. for FFT) dominates the timings. IMHO the small dataset is
better suited to compare the performance of the different VMs.
Anyway, for the large dataset LuaJIT 1.1.x gets 89.80 and Lua 5.1.4
gets 13.87 on my box. And here's a preliminary result for LJ2:
FFT 85.91 
SOR 821.95 
SPARSE 324.05 [100000, 1000000]
LU 820.51 
SciMark 425.11 [large problem sizes]
(i.e. 31x faster than plain Lua)
For comparison, GCC 4.3.2 -m32 -march=core2 -O3 -fomit-frame-pointer:
FFT 93.00 
SOR 883.53 
SPARSE 716.08 [100000, 1000000]
LU 1149.43 
SciMark 604.44 [large problem sizes]
(I've reformatted the C program output to match the Lua output.)
Ok, so LJ2 is getting closer to C's performance. It's already on
par with C for simpler benchmarks (like mandelbrot or spectralnorm).
But there still remains a lot to do ...