lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


Benjamin Segovia wrote:
> For various reasons (mostly because the machine I am simulating is a
> SIMD vector machine), I need to handle forward jumps with the idiom:
>
> while true do
> {... a lot of compute ...}
> if some_condition_on_the_lane_of_the_simd_vector then break end
> {... a lot of compute ...}
> break -- this one is to exit anyway
> end

Well, you're (ab)using a loop construct where you really want a
simple branch. The region selection heuristics of LuaJIT don't
appreciate it, if you abuse control-flow constructs like that.

> Mostly, I think the error is "loop unroll limit reached"
>
> Strangely, there is no loops to unroll in my code.

Of course there's a loop, because you used a loop construct.
Whether that loop actually loops back is an orthogonal issue.

In fact, as you can see from the message, it recognizes this and
unrolls the loop. But there has to be a total limit per trace for
the number of unrolls. See: http://luajit.org/running.html#opt_O

So you should use an if/then/else construct, because that's what
it really is.

> It is more complicated to generate the code like that because there
> may be a lot of "break instruction" (the machine supports native
> structured branch instructions) and this will require to _nest_ more
> and more successive basic blocks into if statements.

I do not see how that's different from doing it with a while loop.
You'd need to nest them and close them with 'end' as well.

> In C, you will just use goto. Here I really try to simulate a forward branch.

I might add goto to LuaJIT sometime (busy right now).

> Also, LuaJIT really behaves strangely. Because using lot of branches
> to simulate forward jumps does not seem to be really the problem. The
> weird thing is that for some reason, if a sequence of _lua_ statements
> with no branch at all is too _big_ then no trace is compiled.

It needs to be _really_ big. The unroll limit hits first for your
example.

> If I add for loops for each statement (I do _not_ unroll the
> vector computations), then the for loops are compiled and
> performance is way better.

Then the inner loops are compiled first and stitched together with
side traces. That's less efficient, especially when the inner
loops have a low iteration count. It's still faster than if it
fails to compile anything at all. But you're missing out on the
max. performance.

It works like this:

  Step 1       Step 2       Step 3       Step 4
                                         .---------.
                                         V         |
  Loop1<--.    Loop1<--.    Loop1<--.    Loop1<--. |
    `-----'      `-----'    | `-----'    | `-----' |
                            |            |         |
               Loop2<--.    Loop2<--.    Loop1<--. |
                 `-----'      `-----'    | `-----' |
                                         |         |
                                         `---------'

But you really want this:

  .--------.
  V        |
  Loop1[0] |
  Loop1[1] |
  Loop1[2] |
  |        |
  Loop2[0] |
  Loop2[1] |
  Loop2[2] |
  |        |
  `--------'

--Mike