lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

Adam Strzelecki wrote:
> Just one question about allocation sinking and vectors - You
> said it is easier for SIMD vectors as they are immutable value
> types. So if we are here to avoid allocation at all when the
> values do not leave the trace, then where the intermediate
> results from vector operators (i.e. A = A + B * C) would be
> kept? Registers, stack, separate static memory pool?

Registers or stack. In case you run out of registers, this is
basically converting the heap allocation to a stack allocation.
Except that it's a side-effect of the register allocator doing its
thing. It doesn't need to match the original structure layout and
it may not spill all of it to the stack.

Strictly speaking, this is more efficient than classic C stack
allocation. Though modern C compilers are quite good at avoiding
the stack allocation in turn and keeping everything in registers.

> The only full reference of allocation sinking I found
> "Allocation Removal by Partial Evaluation in a Tracing JIT"
> deals rather with constants, and moving them and alternatively
> postponing allocation to trace exists.

> If there are to be user-definable operator intrinsics when there
> would need to be some ABI for passing operands and returning
> results, right? Like passed via XMM or via stack?

Those intrinsics need to match the operand types the CPU supports.
That's usually a specific register class. On x86/x64 many
instructions also allow a memory operand as source. But making use
of that is really an optimization (fusing a load into an operand,
i.e. -Ofuse). The instructions for most other CPUs operate on
registers only. These intrinsics look like regular functions at
higher levels, but the implementation details are very different.

> Then long vectors won't fit into all registers anyway. If we
> want to multiply to 4x4 matrices and fit operands and results
> into 128-bit registers we would need 12 of them, when they are
> only 8 in 32-bit SSE. Sounds pretty complicated.

Well, you need to split this up into lots of MULPS, ADDPS and
shuffles, anyway. Each one of these only takes two operands. The
register allocator will make sure to generate as few spills or
restores to/from the stack as possible.