• Subject: Re: LuaJIT - loop conditional code generation
• From: Jani Piitulainen <jani.piitulainen+ll@...>
• Date: Wed, 29 Feb 2012 21:28:33 +0200

Branchless inner loop is indeed significantly faster, in this case 120%.

Following code outputs this on my system (Ubuntu 11.10 x64, 2.4GHz i7, git head luajit)

Branching version took 3.83 seconds
Branchless took 1.72 seconds

On vanilla lua 5.1.4 same code has the opposite effect: branching version is a bit over 100% faster (306.46s vs 634.12s). Can't call that unexpected, though. It'd of course be faster if I removed tobit() calls.

--

local bit = require("bit")
local band=bit.band
local rshift = bit.rshift
local tobit = bit.tobit
local bxor = bit.bxor

function test()
local a=0
local c=1

for i=1, 100000000, 1 do
if band(i, 7)==0 then
c = tobit(c + 1)
end
if band(i, 7)==3 then
c = tobit(c - 1)
end
a = tobit(a + c)
end
return a,c
end

function test2()
local a=0
local c=1

for i=1, 100000000, 1 do
c = tobit(c + rshift(band(i, 7)-1, 31))
c = tobit(c - rshift(bxor(band(i, 7), 3)-1, 31))
a = tobit(a + c)
end
return a,c
end

test()
start = os.clock()
for j=1, 10, 1 do
test()
end
print("Branching version took "..os.clock()-start.." seconds")

test2()
start = os.clock()
for j=1, 10, 1 do
test2()
end

/Jani

On Wed, Feb 29, 2012 at 8:20 PM, Mike Pall wrote:
Jani Piitulainen wrote:
> Is there a way to stop simple conditionals in a loop from jumping out of a
> compiled trace?

No.

> "bit.band" with one argument is used here to tell LuaJIT result is a 32-bit
> integer, it's compiled to nothing. It can sometimes be useful reducing
> unnecessary double <-> int conversions (like cvtsi2sd & cvtsd2si).

That's what bit.tobit() is for. But the effect would be the same.

>   for i=1, 1000000, 1 do
>     if band(i, 7)==0 then
>       c = band(c + 1)
>     end
>     if band(i, 7)==3 then
>       c = band(c - 1)
>     end
>     a = band(a + c)
>   end
>
> Using LuaJIT git head x64. The generated code is very similar for x86.
>
> ebx = i, ebp = a, eax = c
>
> ->LOOP:
> 394cffd0  mov r15d, ebx
> 394cffd3  and r15d, +0x07
> 394cffd7  jz 0x394c0028 ->6 -- this jumps out of the JITted trace, right?
> why not jnz to "cmp r15d, +0x03" here + add eax, 1?
> 394cffdd  cmp r15d, +0x03
> 394cffe1  jz 0x394c002c ->7
> 394cffec  cmp ebx, 0x000f4240
> 394cfff2  jle 0x394cffd0        ->LOOP

This is just the first trace. Look at the other traces, they
connect to this trace and build up the full control flow graph of
all hot paths incrementally. The branches in the parent traces are

Admittedly the generated code isn't optimal. Trace compilers have
some difficulties generating optimal code for control flow with
unbiased branches. Hyperblock scheduling ought to solve this, but
that's not something I can implement anytime soon.

Or consider using branch-free code: that's quite easy to do with a
few bit operation tricks. One idea is to use bit.arshift(x, 31)
which creates a mask of all 0s or 1s, depending on the sign of x.
select between results y and z. This can be further reduced in
many cases.

There are some other tricks, e.g. this ...

c = band(c + rshift(band(i, 7)-1, 31))

... can replace your first conditional.

>   local a=band(0)
>   local c=band(1)

You don't need to do that. The JIT compiler always uses the
smallest possible representation for constants. I.e. 0 and 1 are
treated as integers and converted to FP constants as needed.
There's no cost involved with constant conversions, since they
happen at compile time.

>     -- changing conditions into following form produces effectively identical x64 code
>     -- c = band(c + (band(i, 7)==0 and 1 or 0))
>     -- c = band(c - (band(i, 7)==3 and 1 or 0))

Not surprising, since 'and' and 'or' are turned into control flow
at the bytecode level, i.e. the same as if/then/else/elseif.

--Mike

• Follow-Ups: