lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

Jani Piitulainen wrote:
> Branchless inner loop is indeed significantly faster, in this case 120%.

And the generated code is very good, too:

394cffc0  mov r14d, r15d
394cffc3  and r14d, +0x07
394cffc7  lea r13d, [r14-0x1]
394cffcb  shr r13d, 0x1f
394cffcf  add ebx, r13d
394cffd2  xor r14d, +0x03
394cffd6  add r14d, -0x01
394cffda  shr r14d, 0x1f
394cffde  sub ebx, r14d
394cffe1  add ebp, ebx
394cffe3  add r15d, +0x01
394cffe7  cmp r15d, 0x05f5e100
394cffee  jle 0x394cffc0	->LOOP
394cfff0  jmp 0x394c001c	->3

The ARM code is pretty cool (side effect of -Ofuse optimization):

00367fd4  and   r8, r9, #7
00367fd8  sub   r7, r8, #1
00367fdc  add   r10, r10, r7, lsr #31
00367fe0  eor   r8, r8, #3
00367fe4  sub   r8, r8, #1
00367fe8  sub   r10, r10, r8, lsr #31
00367fec  add   r11, r10, r11
00367ff0  add   r9, r9, #1
00367ff4  cmp   r9, r0
00367ff8  ble   0x00367fd4	->LOOP
00367ffc  bl    0x00360024	->3