[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Re: Possible Bug in bitlib under Windows?
- From: Mike Pall <mikelu-0812@...>
- Date: Sat, 13 Dec 2008 20:08:28 +0100
David Given wrote:
> Mike Pall wrote:
> > bor/band/bxor 129.5 ns 67.0 ns 13.7 ns 0.0 ns
>
> That's impressive performance for LuaJIT 2; does that 0.0ns also include
> the overhead needed to call out to a Lua extension function in C, or is
> there some streamlined mechanism to emit the appropriate instructions
> inline with the generated code? (Or, more prosaically, does your
> benchmark not count the C overhead?)
There is no C overhead. The LuaJIT 2.x interpreter just dispatches
to an internal "fast function", written in assembler. That and the
argument setup cause the 13.7ns (mainly the dispatch overhead for 3
bytecodes).
The trace compiler basically ignores all control flow, including
function calls. It uses a natural-loop-first (NLF) region-selection
algorithm and then extends traces from the exits of the loops it
has formed. Loops are also pre-rolled to enable hoisting of
loop-invariant code (e.g. the check for the specialization to the
called function).
It's able to compile this:
local x=0; for i=1,1e9 do x=x+bit.bor(i,1) end
into this machine code (only inner loop shown):
[...]
->loop:
mov edi, esi
or edi, +0x01
cvtsi2sd xmm6, edi
addsd xmm7, xmm6
add esi, +0x01
cmp esi, 0x3b9aca00
jle ->loop
jmp ->EXIT_3
Note that the reduction variable needs to be a double in this case
(the sum is larger than an int32 can hold). So the bottleneck is the
addsd dependency chain with a latency of 3 cycles per instruction.
This is where the basic loop overhead comes from (1 ns = 3 cycles at
3 GHz). The remaining opcodes have plenty of execution bandwidth left
and thus do not contribute to the final result.
Ok, so this is not a useful microbenchmark for measuring the
overhead of individual machine code instructions (*). But the
intention was to show a (coarse) relative comparison of the cost of
bit operations across Lua implementations.
(*) Like most other integer instructions "or reg, imm" has 1 uop and
1 cycle latency on a Core 2.
--Mike
- References:
- Re: Possible Bug in bitlib under Windows?, duck
- Re: Possible Bug in bitlib under Windows?, Andrew Gorges
- Re: Possible Bug in bitlib under Windows?, KHMan
- RE: Possible Bug in bitlib under Windows?, Jeff Wise
- Re: Possible Bug in bitlib under Windows?, KHMan
- Re: Possible Bug in bitlib under Windows?, David Manura
- Re: Possible Bug in bitlib under Windows?, Mike Pall
- Re: Possible Bug in bitlib under Windows?, RJP Computing
- Re: Possible Bug in bitlib under Windows?, Mike Pall
- Re: Possible Bug in bitlib under Windows?, David Given