[Date Prev][Date Next][Thread Prev][Thread Next]
- Subject: Re: Benchmark shootout shows LuaJIT 2.0 (was Re: [ANN] LuaJIT-2.0.0-beta1)
- From: Mike Pall <mikelu-0911@...>
- Date: Mon, 2 Nov 2009 13:16:49 +0100
Bulat Ziganshin wrote:
> > Only the hand-vectorized stuff in C and C++ is faster. Guess I
> > need to add auto-vectorization. Well, maybe next week ... ;-)
> with Larrabee support, please :)
I prefer to wait until Intel gets their act together, the silicon
performs well and there's an actual market for it. You know, in
the past I had the displeasure of porting some stuff to Itanium
and I think it never went live ...
That said, trace compilers are especially well suited to
non-monolithic architectures (like GPGPU, Larrabee or Cell).
Here's the basic idea:
A general-purpose CPU or a dedicated controller runs the
interpreter and collects traces. When a cluster of sufficiently
hot traces has been generated, they are analyzed for suitability
to the special-purpose processing units (e.g. high number of FPU
ops and not too branchy). If yes, the trace cluster is recompiled
for the SPU architecture (heterogeneous archs pose no problem).
After distribution to the SPUs it (hopefully) runs much faster.
It helps that traces are much more linear than regular code and
represent only a minimal hot subset of the code. FYI: A very
unscientific survey showed that one might get away with a trace
cache of 8K-16K in many cases. This would also put TCM to good use
in cache-challenged CPUs.
Hyperblock scheduling should increase the opportunities for SPU
execution, because it replaces branches with predicated execution.
To unlock the full potential of most architectures, one would need
some support for concurrent execution in the VM, too.
[And no, these are just some nice research ideas and not a goal
for LuaJIT anytime soon.]