lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


Hi,

Roberto Ierusalimschy wrote:
> > Conversely, if it is a significant optimization, that would suggest that it
> > might be better to move the K bits from operand to opcode, making the
> > Instruction 8/8/8/8 instead of 6/8/9/9. If my count is right, this would
> > actually increase the number of opcodes by 35, but the implementations
> > of the additional opcodes are small and it would probably only increase
> > the size of the VM by a little.
> > 
> > In fact, if the K versions of the binary opcodes were only used
> > for numeric constants, then the KB-KC variants would be
> > unnecessary and the extra opcodes would just fit into the
> > existing six bits, and furthermore a number of ttisnumber()
> > tests could also be avoided.
> 
> We were thinking exactly that for Lua 5.2 :) (A small correction: I did
> not say, and I do not think, that this is "a significant optimization",
> only an optimization.)

I'd caution against making such a change without benchmarking it:

  math.randomseed(os.time())
  local s="local x=0 for i=1,5e6 do "
  for i=1,100 do
    if math.random(2) ~= 1 then s=s.."x=x+x " else s=s.."x=x+0 " end
  end
  s=s.."end"
  loadstring(s)()

This represents the worst case because it consists mainly of
OP_ADDs which makes the switch(GET_OPCODE(i)) (almost)
predictable and the RKC() branch random, i.e. unpredictable.

Now time this Lua program and compare it to the same program,
except replace the ~= 1 with ~= 0. This makes the RKC() branch
completely predictable (only "x=x+x").

I've benchmarked this on various machines from PIII (short
pipeline, low branch misprediction overhead) to P4 (long
pipeline) to AMDs (somewhere in the middle). The variant with the
unpredictable branch is slower by about 20% to 45% and scales
nicely with the CPU pipeline length. (*)

BUT ... of course this is an extreme case. You've got to scale it
down for the usual cases where operations are neither completely
random nor all of them consist of adds. E.g. the naive test with
a simple alternation between the two additions did not fool the
branch predictor and did not cause a slowdown.

So I've taken a few benchmarks and tried to replace all
operations on constants with references to local variables which
are initialized to the same constant. Well, I could not find a
significant difference. It was probably lost in all the CPU
pipeline noise (e.g. from the unpredictable opcode switch).

So, in the end I'm not sure whether this pays off except for a
targetted benchmark. I guess one would need to implement this
optimization first and then compare the effects on a wide variety
of programs. However I doubt it's a _significant_ optimization.


(*) Note that Intel is phasing out the P4s. The Core 2 Duos have
a much shorter pipeline length and better branch predictors. I.e.
the max slowdown is probably no longer relevant in the long term.
RISCs have shorter pipelines, too.

Bye,
     Mike