[Date Prev][Date Next][Thread Prev][Thread Next]
- Subject: Re: New LuaJIT benchmark results (was Re: [ANN] LuaJIT-2.0.0-beta3)
- From: Mike Pall <mikelu-1003@...>
- Date: Wed, 10 Mar 2010 23:09:06 +0100
Geoff Leyland wrote:
> Modifying the flip loop in fannkuch from:
> for flips=1,1000000 do
> local flips = 1
> while true do
> saves a few percent on fannkuch - I guess because we drop the
> check against a million?
Yes, this helps a bit. But more importantly the for loop was
narrowed to integers and this wasn't really helpful here. It
caused a subsequent widening for the maxflips comparison.
OTOH the while loop is not narrowed and the higher latencies for
the floating-point ops don't matter here.
> I also made a modification to the last loop of the permute
> that's less clear cut.
That loop contributes only 1.2 percent to the total runtime. But
for some reason your changes lead to a better region selection.
As I said: fannkuch looks really simple, but it's a worst case for
a trace compiler.
BTW: A much faster approach to fannkuch would flip and rotate 4
bit chunks of a 64 bit register. That would go up to N=16, which
takes hours right now. But it's no longer 'indexed access' and
Remember that fannkuch is run with -Ohotloop=1 on the shootout,
since this leads to better traces. It's still a win with N=12.
I can pretty directly scale the timings on my system to the
shootout system. That would give us 88s on both 32 and 64 bit for
fannkuch (old: 93.23s and 90.57s). The ratio should go up to 1.81
and 1.78. And nbody's ratio ought to improve to 1.75 and 1.77
after your changes.
So fannkuch stays the median and LuaJIT would then overtake Java
for the median score on both 32 and 64 bit. Yay!
Good job, you should tune more of my benchmark submissions. :-)
[Yes, I know the median score is not the whole story. The upper
percentiles are still too high compared to Java or C++. I need to
improve these the most -- even if it won't affect the score.]