Ask yourself why these closure objects exist: this is because functions may be recursive (and not necessarily by trailing calls), so the variables in closures may refer not to the immediate parent caller frame but to some ancestor frame at arbitrary number of levels in the calls stack). To avoid every access to the closure varaible to pass through a chain, there's a need for a mapping which is created and initialized before each call to rebind each variable for the next inner call (according to the closure's prototype): such object is allocated only if there are external variables used by the function that are not local to the function itself, they are not directly within the stack window.
This explains also because the bytecode needs separate opcodes for accessing registers and upvalues: if they could be directly on the stack, it would be enough to reference upvalues with negative register indexes and then use the stack as a sliding window, like we do in C/C++ call conventions (except that stack indexes in C/C++ are negative for parameters, and positive for local variables, some of the former being cached in actual registers, but still supported by "shadow" variables on the stack, allocated at fixed positions in the stack fram by the compiler or just pushed before the actual parameters and popped to restore these registers after the call).
There's been performance tests that show that closures are not so fast, they can create massive amounts of garbage collected objects (with internal type LUA_TCLOSURE). I think this behavior very curious, and the current implementation that allocates the LUA_TCLOSURE objects on the heap is not the best option, the mapping could be allocated directly in the stack of local variables/registers of the caller (and all these closure objects used by the caller could be merged to a single one, i.e. as the largest closure object needed for inner calls, merged like in an union. The closure objects themselves to not hold any variable value, these are just simple mappings from a small fixed integer sets (between 1 and the number of upvalues of the called function) and variable integers (absolute indexes in the thread's stack where the actual variable is located).
The byte code is not as optimized as it could be: the register numbers are only positive, the upvalue numbers are also only positive, they could forml a single set (positive integers for local registers, negative integers for upvalues, meaning that they are used to index the entries in the closure object to get access to the actual varaible located anywhere on the stack, outside of the immediate parent frame). The generated bytecode is not as optimal as it could be because various operations can only work on registers or constants (like ADD r1,r2,r3) so temporary registers must be allocated by the compiler (let's remember that the number of registers is limited). As well Lua's default engine treat all registers the same, when most of them will work with a single r0 register (an "accumulator") which could be implicit for most instructions, and this would reduce the instruction sizes (currently 32-bit or 64 bits), which is inefficient as it uses too much the CPU L1 data cache.
I'm convinced that the current approach in the existing Lua VM engine and its internal instruction set can be largely improved for better performance, without really changing the language itself, to get better data locality (smaller instruction sizes, less wasted unused bit fields), and elimination of uses of the heap for closures (to dramatically reduce the stress on the garbage collector)