lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


Mark Hamburg wrote:
> Here were the opportunities I saw for optimization in my broadcast
> routine:

Here's a test case with  a slight rewrite to get rid of the vararg
and the pcall (neither of which affects the argument below):

  function broadcast(array, msg)
    for i=1,#array do
      local obj = array[i]
      obj[msg](obj)
    end
  end

  local x = 0
  local obj = { foo = function(o) x = x + 1 end }
  local t = {}
  for i=1,100 do t[i] = obj end
  broadcast(t, "foo")

Have a look at TRACE 2 when run with: luajit -jdump=i test.lua

---- TRACE 2 IR
0001    int SLOAD  #4    RI           loop bound (#array)
0002 >  int LE     0001  +2147483646  guard for narrowed loop
0003    int SLOAD  #3    I            i
0004 >  tab SLOAD  #1                 array
0005    int FLOAD  0004  tab.asize    \
0006 >  int ABC    0005  0003         /  array[i] bounds check
0007    ptr FLOAD  0004  tab.array    \
0008    ptr AREF   0007  0003          ) obj = array[i]
0009 >  tab ALOAD  0008               /
0010 >  str SLOAD  #2                 msg
0011    ptr HREF   0009  0010         \
0012 >  fun HLOAD  0011               /  method = obj[msg]
0013 >  fun FRAME  0012  test.lua:9   specialize to method foo
0014 >  ptr UREFO  test.lua:9  #0     \
0015 >  num ULOAD  0014                \ inlined
0016  + num ADD    0015  +1            / method foo
0017    num USTORE 0014  0016         /
0018  + int ADD    0003  +1           i = i + 1
0019 >  int LE     0018  0001         i < loop bound?
0020 ------ LOOP ------------
0021 >  int ABC    0005  0018         array[i] bounds check
0022    ptr AREF   0007  0018         \
0023 >  tab ALOAD  0022               /  obj = array[i]
0024    ptr HREF   0023  0010         \
0025 >  fun HLOAD  0024               /  method = obj[msg]
0026 >  fun FRAME  0025  test.lua:9   specialize to method foo
0027  + num ADD    0016  +1           \
0028    num USTORE 0014  0027         / inlined method foo
0029  + int ADD    0018  +1           i = i + 1
0030 >  int LE     0029  0001         i < loop bound?
0031    int PHI    0018  0029
0032    num PHI    0016  0027

> 1. The dynamic dispatch is over all a killer for cross-routine
> optimization, but can parts of the method lookup be hoisted based
> on the fact that the key isn't changing within the loop? We still
> have to do the method lookup per object, but I would think that
> parts of it might be amenable to moving outside the loop.

The method lookup (0011/0012) is dependent on an invariant part
(0010 msg) and a variant part (0009 obj). This means it must be
variant and cannot be hoisted. It's duplicated inside the loop
(0024/0025).

The following method dispatch is specialized to the receiver
(0013), which allows inlining of the receiver (0014-0017). But the
specialization itself cannot be hoisted (duplicated to 0026).

Often parts of the inlined method can still be hoisted (here
0027/0028 remain).

Since the method specialization cannot be hoisted, this leads to a
megamorphic dispatch inside the loop as soon as you use 'broadcast'
to send different messages. And there's nothing the compiler can do
about it due to the way the abstraction is written.

The Lua way to solve this is to use memoized code generation keyed
on the message. Table lookups with constant keys are faster, too.

If you pass arrays of homogeneous objects, you should tell the
compiler about it. It cannot infer that. You have to manually hoist
the method lookup out of the loop.

> 2. Are there optimizations available for passing varargs of
> particular lengths to the next level -- particularly in the 0 and
> 1 cases? Beyond that I would expect a simple loop copying the
> values from stack frame to stack frame to probably win, but I
> could imagine some opportunities with respect to the probably not
> uncommon cases where the varargs list is short or empty.

If the receiver is a fixarg function it would necessarily
specialize to the number of arguments. As I said, this is the easy
case for varargs.

The above argument about the megamorphic dispatch is completely
orthogonal to the vararg issue. The 'broadcast' abstraction is at
fault, the vararg problem is a side-issue.

> 3. The array iteration itself is an obvious opportunity for some
> optimization, though would it be better to write it as an ipairs
> call?

As you can see, it can hoist quite a bit of that. Do not use an
iterator when you can use a plain 'for' loop. The latter is much
easier to optimize for the compiler.

> 4. If we keep broadcasting the same message, are there
> opportunities to optimize based on that? That risks explosive
> growth if we don't routinely use the same message, so are there
> ways that the code could be structured to indicate that
> specialization based on message is likely to be useful?

Use memoized code generation, see above.

> In particular, consider the case where we are sending a
> particular message with 0 or 1 arguments and most objects will
> have the same implementation for that message. Can a tracing JIT
> pick this up?

Yes, it will automatically do this. But as you can see from above:
if you use inhomogeneous objects, the method lookup cannot be
hoisted. C++ would use a vtable dispatch in this case.

> Are there ways to help it pick it up? For example, if I've got a
> recompute graph, I might want to do something like:
> 
> 	broadcast( self.dependents, "markDirty" )
> 
> markDirty may be dynamic to allow for overrides, but most of the
> time it will probably execute fairly standard behavior. It would
> presumably be good to have this result in optimized code for the
> common case.

Performance of dynamic dispatch depends on the predictability of
the dispatch branch. If the dominant receiver is selected >90% of
the time, it's probably ok. If 10 receivers are selected with 10%
probability each, consider a redesign.

--Mike