Very nice ARM code indeed. And similar code should work nicely on ARMv8 as well. I'd love to see it using predicates, although I guess it'd save just one instruction. But that could have other performance advantages concerning pipelining and register dependencies. And a lot of shifter usage, if that happens to matter on ARM.

gcc -O3 emitted this, this branching ran version in 1.596s (LuaJIT  3.83s) with same parameters as earlier Lua version:

  400758:       83 c1 01                add    ecx,0x1
  40075b:       83 c2 01                add    edx,0x1
  40075e:       01 c8                   add    eax,ecx
  400760:       39 d7                   cmp    edi,edx
  400762:       7c 1d                   jl     400781 <test+0x41>
  400764:       89 d6                   mov    esi,edx
  400766:       83 e6 07                and    esi,0x7
  400769:       74 ed                   je     400758 <test+0x18>
  40076b:       83 fe 03                cmp    esi,0x3
  40076e:       40 0f 94 c6             sete   sil
  400772:       83 c2 01                add    edx,0x1
  400775:       40 0f b6 f6             movzx  esi,sil
  400779:       29 f1                   sub    ecx,esi
  40077b:       01 c8                   add    eax,ecx
  40077d:       39 d7                   cmp    edi,edx
  40077f:       7d e3                   jge    400764 <test+0x24>

Generated code for branchless is very good indeed - gcc -O3 was 2.021s (LuaJIT 1.72s) - also note how similar it is to what LuaJIT generated:

  4007a0:       89 ce                   mov    esi,ecx
  4007a2:       83 c1 01                add    ecx,0x1
  4007a5:       83 e6 07                and    esi,0x7
  4007a8:       44 8d 46 ff             lea    r8d,[rsi-0x1]
  4007ac:       83 f6 03                xor    esi,0x3
  4007af:       83 ee 01                sub    esi,0x1
  4007b2:       41 c1 e8 1f             shr    r8d,0x1f
  4007b6:       c1 ee 1f                shr    esi,0x1f
  4007b9:       44 01 c2                add    edx,r8d
  4007bc:       29 f2                   sub    edx,esi
  4007be:       01 d0                   add    eax,edx
  4007c0:       39 cf                   cmp    edi,ecx
  4007c2:       7d dc                   jge    4007a0 <test2+0x10>


> Branchless inner loop is indeed significantly faster, in this case 120%.

And the generated code is very good, too:

394cffc0  mov r14d, r15d
394cffc3  and r14d, +0x07
394cffc7  lea r13d, [r14-0x1]
394cffcb  shr r13d, 0x1f
394cffcf  add ebx, r13d
394cffd2  xor r14d, +0x03
394cffd6  add r14d, -0x01
394cffda  shr r14d, 0x1f
394cffde  sub ebx, r14d
394cffe1  add ebp, ebx
394cffe3  add r15d, +0x01
394cffe7  cmp r15d, 0x05f5e100
394cffee  jle 0x394cffc0        ->LOOP
394cfff0  jmp 0x394c001c        ->3

The ARM code is pretty cool (side effect of -Ofuse optimization):

00367fd4  and   r8, r9, #7
00367fd8  sub   r7, r8, #1
00367fdc  add   r10, r10, r7, lsr #31
00367fe0  eor   r8, r8, #3
00367fe4  sub   r8, r8, #1
00367fe8  sub   r10, r10, r8, lsr #31
00367fec  add   r11, r10, r11
00367ff0  add   r9, r9, #1
00367ff4  cmp   r9, r0
00367ff8  ble   0x00367fd4      ->LOOP
00367ffc  bl    0x00360024      ->3