Re: LuaJIT - loop conditional code generation

local bit = require("bit")

local band=bit.band

local rshift = bit.rshift

local tobit = bit.tobit

local bxor = bit.bxor

function test()

local a=0

local c=1

for i=1, 100000000, 1 do

if band(i, 7)==0 then

c = tobit(c + 1)

end

if band(i, 7)==3 then

c = tobit(c - 1)

end

a = tobit(a + c)

end

return a,c

end

function test2()

local a=0

local c=1

for i=1, 100000000, 1 do

c = tobit(c + rshift(band(i, 7)-1, 31))

c = tobit(c - rshift(bxor(band(i, 7), 3)-1, 31))

a = tobit(a + c)

end

return a,c

end

test()

start = os.clock()

for j=1, 10, 1 do

test()

end

print("Branching version took "..os.clock()-start.." seconds")

test2()

start = os.clock()

for j=1, 10, 1 do

test2()

end

/Jani

On Wed, Feb 29, 2012 at 8:20 PM, Mike Pall <mikelu-1202@mike.de> wrote:

Jani Piitulainen wrote:
> Is there a way to stop simple conditionals in a loop from jumping out of a
> compiled trace?

No.

> "bit.band" with one argument is used here to tell LuaJIT result is a 32-bit
> integer, it's compiled to nothing. It can sometimes be useful reducing
> unnecessary double <-> int conversions (like cvtsi2sd & cvtsd2si).

That's what bit.tobit() is for. But the effect would be the same.

> for i=1, 1000000, 1 do
> if band(i, 7)==0 then
> c = band(c + 1)
> end
> if band(i, 7)==3 then
> c = band(c - 1)
> end
> a = band(a + c)
> end
>
> Using LuaJIT git head x64. The generated code is very similar for x86.
>
> ebx = i, ebp = a, eax = c
>
> ->LOOP:
> 394cffd0 mov r15d, ebx
> 394cffd3 and r15d, +0x07
> 394cffd7 jz 0x394c0028 ->6 -- this jumps out of the JITted trace, right?
> why not jnz to "cmp r15d, +0x03" here + add eax, 1?
> 394cffdd cmp r15d, +0x03
> 394cffe1 jz 0x394c002c ->7
> 394cffe7 add ebp, eax
> 394cffe9 add ebx, +0x01
> 394cffec cmp ebx, 0x000f4240
> 394cfff2 jle 0x394cffd0 ->LOOP

This is just the first trace. Look at the other traces, they
connect to this trace and build up the full control flow graph of
all hot paths incrementally. The branches in the parent traces are
patched to jump to the child traces.

Admittedly the generated code isn't optimal. Trace compilers have
some difficulties generating optimal code for control flow with
unbiased branches. Hyperblock scheduling ought to solve this, but
that's not something I can implement anytime soon.

Or consider using branch-free code: that's quite easy to do with a
few bit operation tricks. One idea is to use bit.arshift(x, 31)
which creates a mask of all 0s or 1s, depending on the sign of x.
This can be used with bor(band(y, mask), band(z, bnot(mask))) to
select between results y and z. This can be further reduced in
many cases.

There are some other tricks, e.g. this ...

c = band(c + rshift(band(i, 7)-1, 31))

... can replace your first conditional.

> local a=band(0)
> local c=band(1)

You don't need to do that. The JIT compiler always uses the
smallest possible representation for constants. I.e. 0 and 1 are
treated as integers and converted to FP constants as needed.
There's no cost involved with constant conversions, since they
happen at compile time.

> -- changing conditions into following form produces effectively identical x64 code
> -- c = band(c + (band(i, 7)==0 and 1 or 0))
> -- c = band(c - (band(i, 7)==3 and 1 or 0))

Not surprising, since 'and' and 'or' are turned into control flow
at the bytecode level, i.e. the same as if/then/else/elseif.

--Mike