lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


David Given wrote:
> Mike Pall wrote:
> > bor/band/bxor    129.5 ns    67.0 ns    13.7 ns     0.0 ns
> 
> That's impressive performance for LuaJIT 2; does that 0.0ns also include
> the overhead needed to call out to a Lua extension function in C, or is
> there some streamlined mechanism to emit the appropriate instructions
> inline with the generated code? (Or, more prosaically, does your
> benchmark not count the C overhead?)

There is no C overhead. The LuaJIT 2.x interpreter just dispatches
to an internal "fast function", written in assembler. That and the
argument setup cause the 13.7ns (mainly the dispatch overhead for 3
bytecodes).

The trace compiler basically ignores all control flow, including
function calls. It uses a natural-loop-first (NLF) region-selection
algorithm and then extends traces from the exits of the loops it
has formed. Loops are also pre-rolled to enable hoisting of
loop-invariant code (e.g. the check for the specialization to the
called function).

It's able to compile this:

  local x=0; for i=1,1e9 do x=x+bit.bor(i,1) end

into this machine code (only inner loop shown):

  [...]
->loop:
  mov edi, esi
  or edi, +0x01
  cvtsi2sd xmm6, edi
  addsd xmm7, xmm6
  add esi, +0x01
  cmp esi, 0x3b9aca00
  jle ->loop
  jmp ->EXIT_3

Note that the reduction variable needs to be a double in this case
(the sum is larger than an int32 can hold). So the bottleneck is the
addsd dependency chain with a latency of 3 cycles per instruction.
This is where the basic loop overhead comes from (1 ns = 3 cycles at
3 GHz). The remaining opcodes have plenty of execution bandwidth left
and thus do not contribute to the final result.

Ok, so this is not a useful microbenchmark for measuring the
overhead of individual machine code instructions (*). But the
intention was to show a (coarse) relative comparison of the cost of
bit operations across Lua implementations.

(*) Like most other integer instructions "or reg, imm" has 1 uop and
    1 cycle latency on a Core 2.

--Mike