lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

Hi, wrote:
> I've wondered if a VM with byte-aligned sub-fields would be a 
> performance gain?

Probably not.

The opcode + the operands are moved to a register with a single
32 bit fetch and then the subfields are moved to other registers.
Bitfield access ( (a >> M) & ((1 << N)-1) ) is either compiled
to SHIFT + AND or a native bitfield instruction. The AND for the
MSF and the SHIFT for the LSF are optimized away and you get
1xFetch + 3xSHIFT + 3xAND.

The only advantage of byte-aligned subfields would be that they
could be fetched individually with 4 x zero-extend fetches (8->32 bit).

Although this is a bit unintuitive the first is likely to be
faster on modern CPUs. This is because it allows for more 
instruction level parallelization.

First a zero-extend fetch may be split internally into (at least)
two micro-uops each (some CPUs have a fast path for this, though).
One for the fetch unit and at least one for the execution unit.
And while there are usually many execution units working in parallel,
there are not that many fetch units. It's likely that you get
pipeline stalls with the latter aproach. And you need instructions
using these operands, too. These are most probably fetches, too
(access to the Lua variable stack or the constants array).

Oh and you tie up a register for the Lua program counter for a
longer duration. This is bad for register starved machines (x86).
Note: it depends on how you compile Lua (-fomit-frame-pointer et al),
but usually the base (pointer to variable on the Lua stack) is the
only C variable kept in a register in luaV_execute().

But byte-aligned subfields have practical disadvantages, too:
there is no need for 256 different opcodes, but one may want
larger operands.

> Ditto for a 64 bit VM... ;-)

Extending the sub-fields would not buy you much (how many local
variables do you need?). Fetching two instructions at once is probably
more trouble than it's worth. Current 64 bit CPUs are still heavily
optimized for 32 bit operations.

There are only a few algorithms that benefit from 64 bit operations
(e.g. cryptography). In fact the increased memory bandwidth
requirement for storing pointers is usually a drawback (but some
apps really need the larger address space).

The main gain from the x86 -> x64 move is due to the fact that
it doubles the number of registers and not just the size of them.
Compare this with PPC32 vs. PPC64 performance numbers where the
differences are a lot less pronounced.