gcc -O3 emitted this, this branching ran version in 1.596s (LuaJIT 3.83s) with same parameters as earlier Lua version:
40075b: 83 c2 01 add edx,0x1
40075e: 01 c8 add eax,ecx
400760: 39 d7 cmp edi,edx
400762: 7c 1d jl 400781 <test+0x41>
400764: 89 d6 mov esi,edx
400766: 83 e6 07 and esi,0x7
400769: 74 ed je 400758 <test+0x18>
40076b: 83 fe 03 cmp esi,0x3
40076e: 40 0f 94 c6 sete sil
400772: 83 c2 01 add edx,0x1
400775: 40 0f b6 f6 movzx esi,sil
400779: 29 f1 sub ecx,esi
40077b: 01 c8 add eax,ecx
40077d: 39 d7 cmp edi,edx
40077f: 7d e3 jge 400764 <test+0x24>
Generated code for branchless is very good indeed - gcc -O3 was 2.021s (LuaJIT 1.72s) - also note how similar it is to what LuaJIT generated:
4007a0: 89 ce mov esi,ecx
4007a2: 83 c1 01 add ecx,0x1
4007a5: 83 e6 07 and esi,0x7
4007a8: 44 8d 46 ff lea r8d,[rsi-0x1]
4007ac: 83 f6 03 xor esi,0x3
4007af: 83 ee 01 sub esi,0x1
4007b2: 41 c1 e8 1f shr r8d,0x1f
4007b6: c1 ee 1f shr esi,0x1f
4007b9: 44 01 c2 add edx,r8d
4007bc: 29 f2 sub edx,esi
4007be: 01 d0 add eax,edx
4007c0: 39 cf cmp edi,ecx
4007c2: 7d dc jge 4007a0 <test2+0x10>
On Wed, Feb 29, 2012 at 9:49 PM, Mike Pall
<mikelu-1202@mike.de> wrote:
Jani Piitulainen wrote:
> Branchless inner loop is indeed significantly faster, in this case 120%.
And the generated code is very good, too:
->LOOP:
394cffc0 mov r14d, r15d
394cffc3 and r14d, +0x07
394cffc7 lea r13d, [r14-0x1]
394cffcb shr r13d, 0x1f
394cffcf add ebx, r13d
394cffd2 xor r14d, +0x03
394cffd6 add r14d, -0x01
394cffda shr r14d, 0x1f
394cffde sub ebx, r14d
394cffe1 add ebp, ebx
394cffe3 add r15d, +0x01
394cffe7 cmp r15d, 0x05f5e100
394cffee jle 0x394cffc0 ->LOOP
394cfff0 jmp 0x394c001c ->3
The ARM code is pretty cool (side effect of -Ofuse optimization):
->LOOP:
00367fd4 and r8, r9, #7
00367fd8 sub r7, r8, #1
00367fdc add r10, r10, r7, lsr #31
00367fe0 eor r8, r8, #3
00367fe4 sub r8, r8, #1
00367fe8 sub r10, r10, r8, lsr #31
00367fec add r11, r10, r11
00367ff0 add r9, r9, #1
00367ff4 cmp r9, r0
00367ff8 ble 0x00367fd4 ->LOOP
00367ffc bl 0x00360024 ->3
--Mike