[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Re: LuaJIT performance query
- From: Mike Pall <mikelu-0806@...>
- Date: Sun, 15 Jun 2008 16:38:40 +0200
David Given wrote:
> Is it still necessary to cache globals in local variables for decent
> performance, or can I expect the JIT to optimise this for me?
>
> For example:
>
> i = 0
> while (i < 10000) do i = i + 1 end
>
> vs.
>
> local i = 0
> while (i < 10000) do i = i + 1 end
I wouldn't call this a typical example of caching. Using local
variables for loop counters is just common sense. :-)
But the basic idea is of course the same as in this more typical
example:
local s = 0; for i=1,1000 do s = s + math.sin(i) end
-->
local sin = math.sin; local s = 0; for i=1,1000 do s = s + sin(i) end
LuaJIT 1.x makes a very literal translation of the bytecode to
machine code. So you still need to cache table lookups for optimal
performance of inner loops.
LuaJIT 2.x (no, it's not ready for release) knows how to hoist
table lookups of all kinds out of loops. So the above examples
with math.sin have the same performance.
And to show the general case, the following four loops have
identical performance, too:
local x=1; for i=1,n do x=x+i end -- local
x=1; for i=1,n do x=x+i end -- global (hash lookup)
local t={1}; for i=1,n do t[1]=t[1]+i end -- array lookup
local t={x=1}; for i=1,n do t.x=t.x+i end -- member field (hash lookup)
Of course this just another silly example where a local would be
the most logical choice. But these cases can arise in practice in
OO-centric programs (self.count = self.count + 1) and as a
side-effect of heavy inlining.
Vyacheslav Egorov wrote:
> I do not think that LuaJIT violates Lua semantics. It is not possible
> to determine live range of the given global in the presence of
> functional environments, metatables, dynamic code loading.
It's possible if the complex bytecode instructions are broken up
into smaller operations. The combination of store-to-load
forwarding (in linear control flow and across loop iterations) and
store sinking can remove almost all redundant aggregate access.
This is the same effect that traditional static compilers can
achieve with scalar replacement of aggregates. And since alias
analysis is much easier in Lua (no pointers) even a simplistic
disambiguation algorithm can be far more effective.
Metatables and other dynamisms pose no problem either. A direct or
indirect metatable access is just a special load or store to the
underlying object. If the alias analysis doesn't find a conflicting
store inside the loop (setmetatable to the same object), it can
safely hoist all metatable loads out of the loop. Even if it finds
a store, it can often just forward it to the load and sink the
store. Ditto for function environments.
Dynamic code loading can introduce new closures, but the identity
of all closures is checked anyway. If a check later fails, the
trace will be extended with a side branch to handle the new
closures.
E.g. here's the SSA IR generated by LuaJIT 2.x for the uncached
math.sin loop:
0004 I Knum KNUM 1
0005 I Knum KNUM 1000
0006 >P num SLOAD #2 i
0007 I Kstr KSTR "math"
0008 I Kfun KFUNC 0x8072c88
0009 >I fun SLOAD #0 (current closure)
0010 >I fun EQ 0009 0008
0011 I tab FLOAD 0008 cl_env getfenv()
0012 >I Kstr HREF 0011 0007
0013 >I tab HLOAD 0012 math
0014 I Kstr KSTR "sin"
0015 >I Kstr HREF 0013 0014
0016 >I fun HLOAD 0015 math.sin
0017 I Kfun KFUNC 0x80714a0
0018 >I fun EQ 0016 0017 check for math.sin closure identity
0019 num SIN 0006
0020 >P num SLOAD #1 s
0021 num ADD 0019 0020
0022 num ADD 0006 0004
0023 > num LE 0022 0005
0024 num PHI 0006 0022 i
0025 num PHI 0020 0021 s
1st column: IR instruction number (implicit SSA ref)
2nd column: I: invariant, P: left PHI operand, >: guard.
3rd column: K: constant, num/str/fun/tab/...: typed IR
4th column: IR operation
5th/6th column: IR operands (SSA refs or literals)
As you can see almost all operations are invariant and will be
hoisted out of the loop. Scalar stores are implicit and not
represented in the IR because all of them can be sunk. The only
remaining operations inside the loop are SIN, two ADDs and a LE
comparison. The only guard left inside the loop is the loop exit
condition. This is the minimum number of operations -- a C compiler
couldn't generate any better code.
[ And for the specialists: no, there's no metatable lookup for the
function environment or for the math table because the type checks
implied by the loads suffice. If a table value is not nil, neither
an __index nor a __newindex metamethod can ever be called. ]
--Mike