The distinction between "stack based" and "register based" is artificial. "Registers" are just a specification of a hierarchy between a limited set of storage units with "fast" or "compact" access and a larger set of storage units which requires a less compact bytecode. But this has absolutely no consequence on how the runtime will allocate the "physical" register space in an actual CPU. And anyway CPUs also use another internal representation, as hardware instruction decoders already transform the "native" code into another representation (so the old debate between RISC and CISC is over! CISC instruction sets are automatically converted into RISC inside CPU instruction decoders, where there are in fact MORE hardware registers, where the "native registers" can become a stack, with exposed "registers" becoming just a narrow window within a larger stack, where instructions can be rescheduled in multiple piepleines.
To design a VM you actually don't need such artificial segregation which requires much more work on the compiler, and complicates the task for implementing the final compilation steps, with many more constraints added.
You could very well work with a pure stack-based instruction set (working like PostScript or Forth or the JVM, with additional "operators" to manipulate the stack like "dup", "pop", "index", where "index" acts like if we were accessing numbered registers relatively from a stack frame pointer, acting like the "window" of registers).
The choice is then how to create the instruction set so that it is compact and *may* improve the data locality.
There also exists CPUs that use an internal large stack of registers, which is in fact a local cache for an even larger external stack in memory, with mechanisms for syncing the two stacks similar to "paging", i.e. with automatic commits and loads, not requiring the programmers to make any explicit "load" or "store" for any operation on the stack, but only "move" between "registers" (numbered relatively from a stack frame).
So any "theory" distinguishing "stack-based" and "register-based" instructions is just fuzzy and in fact not justified at all. Both approaches can be implemented in a fast or slow way, both approaches could be implemented with poor or good data locality. It's up to the compiler to correctly setup the target instruction set while converting the source instruction set.
If we want to create a VM which is simple to implement and port, and then easy to optimize to better target the target instruction set, the stack based approach is simpler, and allows much more freedom on the compiler to correctly generate the best code in the target instruction set (optimizers need to analyse the datapaths and dependencies, this is equally complicate for a source ISA which is stack-based or register-based if you want full optimizations; the register-based approach however allows simpler implementation of minor local optimizations without deep inspection of datapaths and dependencies, and that's why hardware instruction decoders use the register-based approach, meaning that they cannot optimize everything but make very local minor optimizations: adding more registers allows them to extend a bit the scope of these local optimizations).
Consider the x86 instruction set, most programs use a handful registers (AX, BX, CX, DX, SP, BP, plus the implicit IP which is explicitly used only by absolute or relative jumps/branches) other registers are undersused (including size-extended registers); the same is true for 68k (many programs do not use more than 4 data registers and 3 address registers)
Vectored-instrutions may use more registers but with specific instructions in specific situations where parallelization can be applied instead of secheduling multiple instructions; but even CPUs without vector instructions in their CISC instructions are implementing a vectorisation to schedule them on multiple pipelines (they use various internal caches so that instructions need not be decoded multiple times, including in tight loops).
The theory between the two approaches just allows comparing them and study their pros and cons for implemeting a compiler, but both are valid and generally a mix of the two will be made by the compiler to correctly schedule instructions in the target ISA (once the limits in the two ISAs are correctly specified and fully understood).