lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

Robert G. Jakabosky wrote:
> One way llvm-lua could improve the speed of static compiled code would be to 
> add support for recording hints during runtime and writting those hints to a 
> file that would then be passed to the static compiler.

Sure, feedback-directed optimization (FDO) would probably help a
lot. But static compilers face some inherent difficulties when
compiling a dynamic language:

- Dynamic languages generate tons of code paths, even for simple
  operations. A static compiler has to compile *all* code paths.
  Apart from the overhead, this pessimizes the fast code paths,
  too (esp. PRE and register allocation). A JIT compiler only
  generates code for paths that are actually executed (this may
  also differ between runs).

- Even with FDO, static compilers have to be very conservative
  about inlining and specialization to limit code growth. A JIT
  compiler can be more aggressive, because it has to deal with a
  much lower number of code paths.

- Not even the best C compilers are able to hoist complex control
  structures out of loops. Once you lower (say) a load from a hash
  table (a loop following the hash chain) to the LLVM IR, you lose
  the original semantics. This usually means you cannot optimize
  it or eliminate it, either.

That's not to discourage you from your work -- just to point out
some difficulties you'll be facing with llvm-lua.

The following trivial Lua function is a nice optimization challenge:

  function f(x, n)
    for i=1,n do x = math.abs(x) end
    return x

The inner loop has two hash table loads (_G["math"], math["abs"]),
a function dispatch and a call to a standard library function.
Here's the corresponding checklist for a Lua compiler in
increasing order of complexity:

LJ1  LJ2
yes  yes  Type-specializes the hash table loads (object and key type).
yes* yes  Inlines both hash table loads (* partially inlined by LJ1).
no   yes  Specializes the loads to the hash slot (omit chain loop).
yes  yes  Specializes to a monomorphic function dispatch for abs().
yes  yes  Inlines the abs() function.
no   yes  Hoists both hash table loads out of the loop.
no   yes  Hoists the abs() out of the loop (abs(abs(x)) ==> abs(x)).
no   yes  Emits an empty loop (could be optimized away, too).

Now try to get 8x yes with a static compiler, while preserving the
full Lua semantics (e.g. somebody might do math.abs = math.sin or
_G might have a metatable etc.). Good luck! :-)

> Below are the results of running scimark-2008-01-22.lua [1] for a performance 
> comparison.
> # taskset -c 1 /usr/bin/lua scimark-2008-01-22.lua large

The large dataset is an out-of-cache problem and cache thrashing
(esp. for FFT) dominates the timings. IMHO the small dataset is
better suited to compare the performance of the different VMs.

Anyway, for the large dataset LuaJIT 1.1.x gets 89.80 and Lua 5.1.4
gets 13.87 on my box. And here's a preliminary result for LJ2:

FFT        85.91  [1048576]
SOR       821.95  [1000]
MC         73.14  
SPARSE    324.05  [100000, 1000000]
LU        820.51  [1000]

SciMark   425.11  [large problem sizes]

(i.e. 31x faster than plain Lua)

For comparison, GCC 4.3.2 -m32 -march=core2 -O3 -fomit-frame-pointer:

FFT        93.00  [1048576]
SOR       883.53  [1000]
MC        180.16  
SPARSE    716.08  [100000, 1000000]
LU       1149.43  [1000]

SciMark   604.44  [large problem sizes]

(I've reformatted the C program output to match the Lua output.)

Ok, so LJ2 is getting closer to C's performance. It's already on
par with C for simpler benchmarks (like mandelbrot or spectralnorm).
But there still remains a lot to do ...