Re: LuaJIT - loop conditional code generation

lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

Subject: Re: LuaJIT - loop conditional code generation
From: Jani Piitulainen <jani.piitulainen+ll@...>
Date: Thu, 1 Mar 2012 00:00:43 +0200

Very nice ARM code indeed. And similar code should work nicely on ARMv8 as well. I'd love to see it using predicates, although I guess it'd save just one instruction. But that could have other performance advantages concerning pipelining and register dependencies. And a lot of shifter usage, if that happens to matter on ARM.

gcc -O3 emitted this, this branching ran version in 1.596s (LuaJIT 3.83s) with same parameters as earlier Lua version:

400758: 83 c1 01 add ecx,0x1

40075b: 83 c2 01 add edx,0x1

40075e: 01 c8 add eax,ecx

400760: 39 d7 cmp edi,edx

400762: 7c 1d jl 400781 <test+0x41>

400764: 89 d6 mov esi,edx

400766: 83 e6 07 and esi,0x7

400769: 74 ed je 400758 <test+0x18>

40076b: 83 fe 03 cmp esi,0x3

40076e: 40 0f 94 c6 sete sil

400772: 83 c2 01 add edx,0x1

400775: 40 0f b6 f6 movzx esi,sil

400779: 29 f1 sub ecx,esi

40077b: 01 c8 add eax,ecx

40077d: 39 d7 cmp edi,edx

40077f: 7d e3 jge 400764 <test+0x24>

Generated code for branchless is very good indeed - gcc -O3 was 2.021s (LuaJIT 1.72s) - also note how similar it is to what LuaJIT generated:

4007a0: 89 ce mov esi,ecx

4007a2: 83 c1 01 add ecx,0x1

4007a5: 83 e6 07 and esi,0x7

4007a8: 44 8d 46 ff lea r8d,[rsi-0x1]

4007ac: 83 f6 03 xor esi,0x3

4007af: 83 ee 01 sub esi,0x1

4007b2: 41 c1 e8 1f shr r8d,0x1f

4007b6: c1 ee 1f shr esi,0x1f

4007b9: 44 01 c2 add edx,r8d

4007bc: 29 f2 sub edx,esi

4007be: 01 d0 add eax,edx

4007c0: 39 cf cmp edi,ecx

4007c2: 7d dc jge 4007a0 <test2+0x10>

/Jani

On Wed, Feb 29, 2012 at 9:49 PM, Mike Pall <mikelu-1202@mike.de> wrote:

Jani Piitulainen wrote:
> Branchless inner loop is indeed significantly faster, in this case 120%.

And the generated code is very good, too:

->LOOP:
394cffc0 mov r14d, r15d
394cffc3 and r14d, +0x07
394cffc7 lea r13d, [r14-0x1]
394cffcb shr r13d, 0x1f
394cffcf add ebx, r13d
394cffd2 xor r14d, +0x03
394cffd6 add r14d, -0x01
394cffda shr r14d, 0x1f
394cffde sub ebx, r14d
394cffe1 add ebp, ebx
394cffe3 add r15d, +0x01
394cffe7 cmp r15d, 0x05f5e100
394cffee jle 0x394cffc0 ->LOOP
394cfff0 jmp 0x394c001c ->3

The ARM code is pretty cool (side effect of -Ofuse optimization):

->LOOP:
00367fd4 and r8, r9, #7
00367fd8 sub r7, r8, #1
00367fdc add r10, r10, r7, lsr #31
00367fe0 eor r8, r8, #3
00367fe4 sub r8, r8, #1
00367fe8 sub r10, r10, r8, lsr #31
00367fec add r11, r10, r11
00367ff0 add r9, r9, #1
00367ff4 cmp r9, r0
00367ff8 ble 0x00367fd4 ->LOOP
00367ffc bl 0x00360024 ->3

--Mike

References:
- LuaJIT - loop conditional code generation, Jani Piitulainen
- Re: LuaJIT - loop conditional code generation, Mike Pall
- Re: LuaJIT - loop conditional code generation, Jani Piitulainen
- Re: LuaJIT - loop conditional code generation, Mike Pall

Prev by Date: Re: LuaRocks can't find libz.so or libz.dll when installing lzlib
Next by Date: Re: special forms, take two (was Re: A lua version of "amb")
Previous by thread: Re: LuaJIT - loop conditional code generation
Index(es):
- Date
- Thread