Re: Why was a stack-based C API chosen?

The distinction between "stack based" and "register based" is artificial. "Registers" are just a specification of a hierarchy between a limited set of storage units with "fast" or "compact" access and a larger set of storage units which requires a less compact bytecode. But this has absolutely no consequence on how the runtime will allocate the "physical" register space in an actual CPU. And anyway CPUs also use another internal representation, as hardware instruction decoders already transform the "native" code into another representation (so the old debate between RISC and CISC is over! CISC instruction sets are automatically converted into RISC inside CPU instruction decoders, where there are in fact MORE hardware registers, where the "native registers" can become a stack, with exposed "registers" becoming just a narrow window within a larger stack, where instructions can be rescheduled in multiple piepleines.

To design a VM you actually don't need such artificial segregation which requires much more work on the compiler, and complicates the task for implementing the final compilation steps, with many more constraints added.

You could very well work with a pure stack-based instruction set (working like PostScript or Forth or the JVM, with additional "operators" to manipulate the stack like "dup", "pop", "index", where "index" acts like if we were accessing numbered registers relatively from a stack frame pointer, acting like the "window" of registers).

The choice is then how to create the instruction set so that it is compact and *may* improve the data locality.

There also exists CPUs that use an internal large stack of registers, which is in fact a local cache for an even larger external stack in memory, with mechanisms for syncing the two stacks similar to "paging", i.e. with automatic commits and loads, not requiring the programmers to make any explicit "load" or "store" for any operation on the stack, but only "move" between "registers" (numbered relatively from a stack frame).

So any "theory" distinguishing "stack-based" and "register-based" instructions is just fuzzy and in fact not justified at all. Both approaches can be implemented in a fast or slow way, both approaches could be implemented with poor or good data locality. It's up to the compiler to correctly setup the target instruction set while converting the source instruction set.

If we want to create a VM which is simple to implement and port, and then easy to optimize to better target the target instruction set, the stack based approach is simpler, and allows much more freedom on the compiler to correctly generate the best code in the target instruction set (optimizers need to analyse the datapaths and dependencies, this is equally complicate for a source ISA which is stack-based or register-based if you want full optimizations; the register-based approach however allows simpler implementation of minor local optimizations without deep inspection of datapaths and dependencies, and that's why hardware instruction decoders use the register-based approach, meaning that they cannot optimize everything but make very local minor optimizations: adding more registers allows them to extend a bit the scope of these local optimizations).

Consider the x86 instruction set, most programs use a handful registers (AX, BX, CX, DX, SP, BP, plus the implicit IP which is explicitly used only by absolute or relative jumps/branches) other registers are undersused (including size-extended registers); the same is true for 68k (many programs do not use more than 4 data registers and 3 address registers)

Vectored-instrutions may use more registers but with specific instructions in specific situations where parallelization can be applied instead of secheduling multiple instructions; but even CPUs without vector instructions in their CISC instructions are implementing a vectorisation to schedule them on multiple pipelines (they use various internal caches so that instructions need not be decoded multiple times, including in tight loops).

The theory between the two approaches just allows comparing them and study their pros and cons for implemeting a compiler, but both are valid and generally a mix of the two will be made by the compiler to correctly schedule instructions in the target ISA (once the limits in the two ISAs are correctly specified and fully understood).

Le ven. 22 mai 2020 à 23:09, Tim Hill <drtimhill@gmail.com> a écrit :

> On May 22, 2020, at 12:06 PM, Joseph C. Sible <josephcsible@gmail.com> wrote:
>
> A lot of emphasis is placed on the fact that Lua bytecode runs on a
> register-based VM, rather than on a stack-based VM, and that this is
> significant for performance reasons. Why is the Lua C API then
> stack-based instead of also being register-based for the same reason?
>
> Joseph C. Sible

Only Roberti can comment of the rationale used in the design of course. But from my perspective it comes down to compile-time vs runtime. The Lua compiler can assign stack “slots” (aka registers) efficiently, and it has been shown that such models (especially with tertiary opcodes) are generally more efficient than pure stack machines (dont ask me to quote sources, its been years since I looked at this).

But for C code, a stack model is more flexible since you dont have to do as much work juggling register assignments. Yes, the stack can be a grind to handle, but a pure register model can be much harder. And in fact the C API is something of a hybrid, in many cases allowing arguments at arbitrary slots, thus reducing the amount of stack shuffling compared to a pure stack model.

I confess it it were me I would have opted for a stack slot model, and APIs what used stack slots consistently (source1, source2, destination), with a special slot value to mean “push/pop from TOS”. However, I think such a design would probably be slower, and Lua is all about a fast VM.

—Tim