lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


>> You'll quickly conclude this could be optimized down to an empty,
>> infinite loop, right? Now try to apply the classic compiler
>> optimizations you've learned back then in that compiler course:

My compiler course taught me to peel my loops :-)

--
e.v.e


On Mon, Aug 30, 2010 at 3:12 AM, Mike Pall <mikelu-1008@mike.de> wrote:
Miles Bader wrote:
> On Mon, Aug 30, 2010 at 8:01 AM, Jean-Yves F. Barbier <12ukwn@gmail.com> wrote:
> > So, AFAI understand, a JIT compiler takes advantage of the context by
> > compiling to machine code only what is needed, opposed to a conventional
> > compiler that produce highly optimized code but as a whole?
>
> Yes, and in some cases that strategy can lead to better code than a
> conventional compiler, because the JIT compiler (even if it's
> "simpler"), can produce code optimized exactly for individual hot
> paths through the code.

Yes, this saves both compile time and often produces better code.
E.g. JVM hotspot only compiles hot methods. A trace compiler,
such as LuaJIT, works at a finer granularity and only compiles
hot paths through the bytecode (usually this means hot loops).

Actually LuaJIT goes one step further and breaks complex bytecodes
apart and only records the taken sub-paths. E.g. the semantics of
index expressions in Lua (value = obj[key]) are quite complex: all
kinds of Lua objects are allowed for both the object and the key,
one has to check for metatables and __index tables or metamethods,
tables are split into an array part and a hash part, and so on ...

You really don't want to inline all of this logic as it creates a
lot of bloat, which in turn makes it hard for the compiler to
analyze and to optimize the resulting mess. For a typical array or
hash access LuaJIT is able to optimize all these dynamism down to
just a couple of machine instructions. This would be near
impossible with a traditional compiler.

Here's a nice example which is rather tricky for most compilers,
but near trivial for a trace-based JIT compiler:

 local a = 1
 while true do
   if a ~= 1 then
     a = 2
   end
 end

You'll quickly conclude this could be optimized down to an empty,
infinite loop, right? Now try to apply the classic compiler
optimizations you've learned back then in that compiler course:

Let's start with constant propagation: well, too bad ... we cannot
prove that 'a' is always 1 inside the loop, because that annoying
'a = 2' is in the way.

Ok, so let's try dead-code elimination first. To get rid of the
part inside the 'if' we'll need to prove the condition is always
false. Alas, for that we'd first need to prove that 'a' is always
constant. Oops ... you see the problem now?

[
Yes, this can be solved in a traditional compiler with an
iterative approach, combining multiple optimizations. But it ain't
pretty. You can read more about this in this 157 page thesis from
Cliff Click (who worked on JVM hotspot later):

 Combining Analyses, Combining Optimizations, Cliff Click, 1995
 http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.17.8510

BTW#1: Yes, GCC handles this example.

BTW#2: Yes, there are other optimizations one could apply to
      infinite loops -- I wanted to simplify the discussion.
      So replace the 'true' with an iterator and let 'a' escape.
      The analysis is still the same.
]

Let's see what a trace compiler does with this example:

The loop gets hot and the trace compiler starts recording there.
It first records the conditional (a ~= 1). Since the JIT compiler
is able to inspect the actual runtime values, it knows the
condition evaluates to false. So it emits a guard for the inverse
condition to the IR (intermediate representation). The 'then' part
is not executed and the next bytecode jumps unconditionally to the
start of the loop. This is the signal to stop trace recording.

Now the trace recorder has recorded exactly one iteration of the
loop: the SLOAD for 'a' and the EQ for the conditional guard (see
the IR below). After that, loop optimizations are applied to the IR,
which hoists out loop-invariant code. The SLOAD is loop-invariant,
as there's no corresponding store. This in turn makes the EQ
loop-invariant. Ok, that's it -- no instructions remain inside the
loop. Now, that was easy! :-)

Here's the output of: luajit -jdump test.lua

---- TRACE 1 start test.lua:2            <-- Start of trace at the loop start
0003  ISEQN    0   0      ; 1            <-- First recorded bytecode
0004  JMP      1 => 0002
0002  LOOP     1 => 0007                 <-- Start of loop (again), stop now
---- TRACE 1 IR
0001 >  num SLOAD  #1    T               <-- Recorded IR, hoisted
0002 >  num EQ     0001  +1
0003 ------ LOOP ------------            <-- Empty IR inside the loop!
---- TRACE 1 mcode 52
394cffbe  mov dword [0x41d574a8], 0x1    <-- Set trace number (for side exits)
394cffc9  movsd xmm1, [0x41d602e0]       <-- Load of constant 1
394cffd2  cmp dword [rdx+0x4], -0x0d     <-- Type check for 'a'
394cffd6  ja 0x394c0010 ->0
394cffdc  movsd xmm0, [rdx]              <-- Load of 'a'
394cffe0  ucomisd xmm0, xmm1             <-- Compare 'a' to 1
394cffe4  jpe 0x394c0014        ->1      <-- These side exits are never taken
394cffea  jnz 0x394c0014        ->1
->LOOP:
394cfff0  jmp 0x394cfff0        ->LOOP   <-- Really fast infinite loop ;-)
---- TRACE 1 stop -> loop

After that it will hang, but that's to be expected from an
infinite loop. Press ctrl-c twice.

BTW: The compiler never sees the 'local a = 1', but it really
doesn't need to. It simply guards that the conditional is always
false, i.e. that 'a' is always 1 (the runtime value it saw). So
the same logic works, even if 'a' is not a constant outside the
loop (now try that with a static compiler). If 'a' has a different
value before it reaches the loop, the trace would be extended with
a branch to a side trace (if this happens often enough).

--Mike