lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


Vyacheslav Egorov wrote:
> I am highly interested how inline is implemented. There are to cases here: 
> "upfunctions" inlining and global function inlining. Both require checks 
> and mechanism to fallback into non-optimized native code or even bytecode. 

In Lua every call is dynamically dispatched or "virtual" (in
C++/Java speak). So the same optimization technique applies: find
the most likely target and convert the dynamic dispatch into a
check and a static dispatch (direct call or an inlined copy).

You can see this at work in LuaJIT 1.1.x:

$ luajit -O -j dump -e 'x=1' -e 'local y = math.abs(x)'
--0004--  CALL        0    2    2
b7d65824  cmp dword [ebx+0x8],byte +0x06     target is a function?
b7d65828  jnz 0xb7d65859
b7d6582a  cmp dword [ebx],0x0808afc8         target is math.abs?
b7d65830  jnz 0xb7d65859
b7d65832  cmp dword [ebx+0x18],byte +0x03    arg is a number?
b7d65836  jnz 0xb7d65859
b7d65838  fld qword [ebx+0x10]               |
b7d6583b  fabs                               | inlined function
b7d6583d  mov dword [ebx+0x8],0x3            |
b7d65844  fstp qword [ebx]                   |

> During execution, after the closure creation f can be inlined (so there is 
> possibility that different closures will not share the same optimized 
> function body). Still if someone assigns to f  then this inlining becomes 
> incorrect and g needs to be deoptimized. So we need some kind of write 
> barrier here. I thought about it sometime ago, but have not managed to 
> invent an elegant implementation (only pointer comparison before inlined 
> body, but this approach is not usable with moving GC).

The simple approach taken by LuaJIT 1.x is to do the target check
every time. Target resolving (GETUPVALUE or GETGLOBAL/GETTABLE)
has to be done every time, too. But these functions are inlined
and quite fast.

Of course it would be even faster if we can remove this overhead
altogether. Type and target prediction is required to generate
the right checks in the first place. Type inference can be used
to completely eliminate the type checks in many cases.

Otherwise one can detect that these checks are invariant and move
them upwards in the dominator tree and coalesce them. This
effectively means hoisting these checks out of inlined functions
and loops. Of course this works for invariant target resolving
and target checks, too.

Care has to be taken that no use of a value is moved beyond a
conflicting store. This is trivial for upvalues (they can only
alias a single stack slot, but not any other upvalue) but tricky
for globals:
  easy: math = nil
  hard: _G.math = nil
  ugly: a[x] = math; b[y].abs = nil
  killer: setfenv(a, b)

Note that the compiler can easily cop out if it gets too
difficult and just encode the regular target resolving+check.
It's sufficient to cover the common cases.

> Also it is interesting how fallback can be implemented.

LuaJIT 1.0 added generic fallback copies of every instruction at
the end of the code for every function. A failed check branches
to the fallback code.

LuaJIT 1.1 encodes branches to special code which invokes
deoptimization (*). This in turn generates fallback code
on-demand and patches the original code to jump to it.

(*) See the three branches to 0xb7d65859 in the above example.

> I am interested both in theoretical and practical part of this, any links 
> and hints will be highly appreciated.

I'll probably settle on a trace-based compiler for LuaJIT 2.x.
This is a significant departure from the classic
function-at-a-time compilation in previous versions.

Trace-based compilation sounds complicated at first, but is much
simpler to implement. Depending on the definition of a trace,
some optimizations are trivial (e.g. a trace is only dominated by
its setup code and itself, no PHIs in the straight-line code).

Descriptions of the technique in the context of bytecode
compilation are here:

  Incremental Dynamic Code Generation with Trace Trees
  A. Gal and M. Franz, 2006

  HotpathVM: An Effective JIT for Resource-constrained Devices
  A. Gal, M. Franz, C. W. Probst, 2006

The technique originated in system emulators doing on-the-fly
binary-to-binary translation. A good reference is here:

  Dynamo: A Transparent Dynamic Optimization System
  V. Bala, E. Duesterwald, S. Banerjia, 2000

Popular software using this technique are QEMU and Valgrind.
Intel's Netburst architecture (Pentium IV) employs a trace cache
to store a copy of the x86 instruction stream translated by the
instruction decoder to RISC-like micro-ops.