Re: Why was a stack-based C API chosen?

The main difference between a pure "stack-based" implementation and a "register-based" implementation (when in fact both use a stack) is the fact that the stack is managed by two pointers: the top of stack (SP) and a frame base pointer (BP); "registers" are easier to design at static offsets from the base pointer, they can be loaded and cached in actual native registers, and saved back to the stack frame when needed, while the stack pointer (BP) allows free growth of the stack, without needing to recompute relative offsets for "registers":

The implementation is simple, and this model is used in almost all C/C++ compilers (because C/C++ functions have variable numbers of parameters with varargs).

It is not needed for languages that have fixed parameters (like Pascal) where a frame pointer register is not necessary to compile such function, so all "registers" can be relative to the top of stack (no need to link a new stack frame; such optimization is also used in C/C++ functions that are declared with static number and types of parameters, i.e. no varargs, and when "inlining" function calls where the type and number of parameter is known (in which case "registers" can be relative to the top of stack and there's more freedom to allocate the "native" registers, and possibly suppress their allocation in a slot of the stack).

Beside this, you always need operations that will reference dynamic relative offsets from the top of stack (e.g. "index" in Forth or PostScript), and some VM provide a way to "rotate" several positions from the top of stack to use one of the position that will no longer be needed after: this allows a reduction of total stack size needed. The implementation can then allocate some hardware registers dynamically, or keep a few of them static (notably SP or A7 on 68k, BP or A6 on 68k, and the element at top of stack is frequently remaining in a native register like AX or D0 or A0 as it changes often: not everything needs to be on the stack and the same register can be used for the first returned value.

There remains then other registers (BX, CX, DX, or D1... D7 and A1.. A5 on 68k) for dynamic use inside the function; as well a "flags" register is for general use and can void pushing/popping booleans on the stack. an x86 does not have a lot of generic registers, except with extensions like vectors, x64 offers much more native registers which may be used in some optimizing compilers; the number of registers that an be used instead of pushing/popping on the stack or writing to the stack frame is platform dependant (and a compiler may generate code that autodetect the type of CPU and will use more registers when it can improve the performance by reducing stack usage, but the problem with CPUs with many registers is with hardware interrupts, exceptions, and thread or process switches: you need more space in stack to save the state and restore it, so these context switches are costly when they occur at arbitrary time; this is not a problem with cooperative threadings, i.e. coroutines, that can be switched much more easily and faster if the compîler has deliberately chosen to not use a large set of registers, or when cooperative context swithing canno occur within code fragments that use more registers than those that a coroutine switch will be saving and restore from the stack)

Lua is a bit special as its VM uses in fact two separate stacks: a generic large stack for environments and managing stack frames and calls to the native libraries, and a small stack for local use only inside Lua functions, and which already acts as a set of registers, the first ones being used relative with static offset for declared local variables and parameters or return values, the rest for temporary usage to evaluate expressions, and to prepare a stack frame for calling a new function that will be placed in the environment stack at known offsets too, so Lua can also uses two pointers: a top of stack (in the small stack) and a base frame pointer (in the large stack).

Then all optimisations for allocating native registers is only for the content of the small stack used inside functions. If the native CPU has a large enough "window" of registers, the small stack used inside function can be avoided completely and you don't even need a register for the top of stack in this small stack as it should always be possible (inside the same function) to determine which register number to use for every operation and every local variables. The VM bytecode however is agnostic about hardware registers, and still uses a "virtual" stack, whose implementation at runtime is effectively a detail.

Le lun. 25 mai 2020 à 15:20, Pierre Chapuis <catwell@archlinux.us> a écrit :

On Fri, May 22, 2020, at 21:06, Joseph C. Sible wrote:

> A lot of emphasis is placed on the fact that Lua bytecode runs on a
> register-based VM, rather than on a stack-based VM, and that this is
> significant for performance reasons. Why is the Lua C API then
> stack-based instead of also being register-based for the same reason?

The stack-based C API predates the register-based VM, which was only
introduced in Lua 5. The way I see it, the C API is an interface designed
for the user, and the use of registers internally (which are themselves
organized in a stack) is an implementation detail.

A "register-based API" would not be much different anyway, the main
alternative to the stack-based API used by Lua would be an "object"
API where pointers to internal objects are returned to the caller. This
kind of API is used in several other languages but it has a huge drawback
compared to what Lua does, which is that callers have to interact with the
GC. It makes it a lot more error-prone in practice.

--
Pierre Chapuis