lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

OK, that makes sense.

Of all the things on the roadmap, the FFI is the feature that I'd like to see implemented as soon as 2.0 is out of beta. IMO, the Python ctypes package is a good example to follow.

Anyway, thanks for all of your great work on LuaJIT 2.

On 3/22/2010 12:08 PM, Mike Pall wrote:
Matt Campbell wrote:
I'd like an explanation of how LuaJIT 2.0's compiler interacts with
external C libraries that use the standard Lua C API.  Specifically, if
my Lua code calls a C function (outside the standard library) in a loop,
are there any circumstances in which that loop can't be compiled to
native code?

Such a call is not compiled at all. The trace is aborted and
LuaJIT falls back to the interpreter. Unless there's a path inside
the loop that happens not to call an unknown C function, the loop
is eventually blacklisted and runs purely in the interpreter.

Rationale: the transition to C code is too costly and the compiler
wouldn't be able to optimize across such a call, anyway.

The transition to C code requires flushing all values to the Lua
stack, setup of a new Lua stack frame and other context,
performing the actual call to the C function, which in turn calls
back into the Lua/C API multiple times to fetch its arguments from
the stack frame etc.. This dominates the cost of calling short C
functions and the interpreter is not that much slower then.

The planned FFI will fix that and will allow direct inlining of
calls to C functions.

Morale: use pure Lua code for your inner loops. Trivial C helper
functions don't pay off and may turn out to be slower than
equivalent Lua code.

My guess is that LuaJIT compiled code can't interact with
external C functions as efficiently as it does with standard library
functions that have fast paths in assembler.  But I'd like to know more
about how this works.

Whether a function has a fast path in the assembler or not is
orthogonal to whether the compiler is able to deal with it or not.
It's just that the internal functions that _do_ have fast paths,
are those with the biggest payoff. So they usually happen to have
an equivalent recording handler in the compiler, too.

Actually the fast paths in the interpreter are never called from
compiled code. The trace recorder recogizes a known function and
simulates the fast path by recording the appropriate IR instead.
This can be as simple as a single instruction (see below) or a
maze of conditions and transformations (e.g. string.sub).

Simple example:

   luajit -jdump -e "local x=0; for i=1,100 do x=bit.bxor(x,i) end"

The trace recorder follows the bytecode and recognizes a call to
the bit.bxor function:

0006  GGET     5   0      ; "bit"
0007  TGETS    5   5   1  ; "bxor"
0008  MOV      6   0
0009  MOV      7   4
0010  CALL     5   2   3
0000  . FUNCC               ; bit.bxor<------
0011  MOV      0   5
0012  FORL     1 =>  0006

This is turned into the equivalent IR (only the inner loop shown):

0020 ------ LOOP ------------
0021    num TONUM  0017
0022  + int BXOR   0018  0017<------
0023  + int ADD    0018  +1
0024>   int LE     0023  +100
0025    int PHI    0018  0023
0026    int PHI    0017  0022

And then compiled to machine code:

394cfff0  xor ebp, ebx<------
394cfff2  add ebx, +0x01
394cfff5  cmp ebx, +0x64
394cfff8  jle 0x394cfff0	->LOOP
394cfffa  jmp 0x394c001c	->3

Easy, huh? Ok, so there's a lot more magic going on inside. E.g.
you may have noticed it has hoisted the lookup of bit.bxor out of
the loop and narrowed everything to integer arithmetic.