[Date Prev][Date Next][Thread Prev][Thread Next]
- Subject: Re: Native Complex numbers for LuaJIT-2 [was Re: Benchmark shootout shows LuaJIT 2.0]
- From: Mike Pall <mikelu-0911@...>
- Date: Mon, 2 Nov 2009 22:05:42 +0100
Leo Razoumov wrote:
> Well, the next step in functions like __add(z1,z2) or __mul(z1,z2)
> will be to check that the second argument is indeed of the type
> "complex". Typically it is done with luaL_checkudata which (1) pulls
> object's metatable onto the stack (2) looks up the required metatable
> in the REGISTRY with lua_getfield(L, LUA_REGISTRYINDEX, tname) by its
> name string and (3) compares two metatables. It is a lot of overhead
> for +-/* of complex numbers. I do not see how LuaJIT can help here,
> for this overhead happens on the C-side of things outside of LuaJIT
> control. Is a good solution possible?
You are still thinking too much interpreter-centric. *None* of
that involves calls to the C-side on the JIT side.
The trace compiler records the _functionality_ of each bytecode,
not the actual C code involved. In effect it does a complete
simulation of every bytecode and all the associated metamethod
stuff, before it's even run.
So here's what happens, if the trace recorder sees the BC_ADD:
- First it checks the runtime type of the two operands: since
these are userdata it resolves their metamethods, recording
every lookup on the way.
- Then it records the call to the resolved metamethod: it sees
that it's one of the special internal fast functions and runs
the associated recording handler.
- The recording handler checks that the arguments it got are
indeed of the right userdata type (recording the functionality
of the checks involved).
- The two complex numbers are pulled from the userdata (recording
four number loads on the way -- the unboxing step), added together
(recording two additions) and stored in a newly created userdata
(recording an allocation and two stores -- the boxing step).
Only after the recording of the bytecode is finished, the actual
bytecode execution starts in the interpreter. And it should
exactly follow the simulated steps and come up with a new userdata
in the end.
Now, if a subsequent BC_MUL of the result is seen, almost the same
happens, except the recorded loads from the userdata are forwarded
from the previous stores (eliminating the unboxing). I.e. the
multiplies directly operate on the adds, bypassing the userdata
object. This makes the stores redundant, which in turn makes the
allocation redundant. Depending on the surrounding code, the
stores and the allocation can be sunk into a side exit or
eliminated altogether (eliminating the boxing).
After hoisting, the end result is a pure expression (z1 + z2) * z3
on complex numbers, that's split up into the scalar operations.