lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

Nicholas Nethercote wrote:
> - The quoted paragraph above mentions "skip-list chains" but AFAICT the code
>   is just using a normal list, not a skip-list
>   (  Have I misunderstood something?

Well, originally it was a full skip list implementation. Now it's
just a degenerate form of it -- the additional complexity didn't
pay off. I kept the name because there's no nice name for a set of
interwoven, segregated, singly-linked lists.

The main point here is that the list is segregated per IR opcode:
there is one anchor per opcode. So if you're CSEing instruction
FOO(a, b) you only need to look in the FOO chain, which is just a
fraction of the IR. And you don't need to match the opcode type;
it's sufficient to check for a match with the operands (a, b),
which can be done with a single 2*16 bit = 32 bit compare.

This structure is helpful for more than CSE. It's already in the
right order for load/store forwarding and dead-store elimination.
Due to the special layout of the IR for loops, this works even
across loop iteration boundaries.

LuaJIT's IR is medium-level, so it uses different instructions for
tasks that would conceptually be the same at a lower level. E.g.
there are many different loads (SLOAD, ALOAD, HLOAD, FLOAD, ...)
in the IR -- the backend treats most of these exactly the same.

But keeping these separate helps CSE, load/store forwarding and
makes many kinds of analysis very cheap. E.g. an ALOAD (array
load) does not need to be CSEd against an HLOAD (hash table load).
And an ASTORE (array store) cannot possibly alias with a HLOAD.

The IR has been specifically designed to preserve the higher-level
semantics of the language and the VM implementation. This pays off
a lot for e.g. alias analysis, which would be either very weak or
would get rather tricky with a low-level IR.

[Lines of code for alias analysis: GCC=10000, LLVM=3000, LuaJIT=80]

E.g. it's dead simple for LuaJIT to hoist the two hash lookups
(for _G["math"] and math["sqrt"]) out of this loop, even though
there's an intervening array store with a non-constant index:

  local t={}; for i=1,100 do t[i] = math.sqrt(i) end

> - I don't understand the "while (ref > lim)" condition, it looks like the
>   list search is terminated early in some cases but I don't understand what
>   those cases are... the IR is very dense and pithy code like that is hard
>   for newcomers to understand.

Well, that's the real secret behind it. :-)

The IR is in strict SSA form. Any reference operand of an
instruction must point to an instruction which strictly precedes
it in the instruction stream. So if you're searching for FOO(a, b)
in the chain for FOO you can stop when you reach lim = max(a, b).
An instruction at or before lim cannot possibly contain references
to either a or b.

Also it's well known that SSA has a high locality of reference. At
least one of the reference operands of an instruction is usually
nearby. This imposes a very strict limit for the length of the
search. In practice the average number of chain lookups for CSE is
_below_ 1. Most searches fail at the first limit check, before
even entering the loop.

> - I tried implementing a simple list in Nanojit (used by Mozilla and Adobe),
>   which currently uses a hash table for CSE.  I tried it only on 32-bit
>   immediates, and the performance was awful -- there can be dozens of
>   different ones in a single block so searching through them linearly was
>   much worse than doing a hashtable lookup, which usually hits in about 1.5
>   lookups (albeit using a not-cheap hash function).  But the quoted
>   paragraph above says "average overhead is less than 1 lookup".  What am I
>   missing?

You have to account for the insertion costs, too. Managing a hash
table is not free (it has to grow, it needs invalidations etc.).
And a hash table doesn't give you an ordered lookup for free
(needed for load/store forwarding). Linked lists also allow you to
do cheap, ordered, cross-instruction checks (e.g. see cse_href).

Also, LuaJIT's IR is not a low-level IR. There's much less need
for having field offsets etc. as constants. I have rarely seen a
trace with more than a handful of integer constants.

Most field offsets etc. are encoded as implicit or explicit
literals in the instructions themselves. E.g. SLOAD #1 is a load
from stack slot number 1 and the 1 is a literal, encoded in one of
the operands.

[BTW: Interning of integers is not done via CSE. It's in
lj_ir_kint(). But it's similar enough.]

And the IR is split into the constant and non-constant part to
improve cache locality The constant part grows downwards from the

... -0002 -0001 REF_BIAS 0001 0002 0003 ...
 IR constants <--- | ---> IR instructions

This also simplifies checking operands for constness. Just compare
the reference itself against REF_BIAS. No need to dereference it.

What you see in the IR dumps (-jdump) is actually an amalgam of
both parts of the IR. Only the non-constant part is dumped,
referenced constants are embedded into the output. E.g. this:

$ luajit -jdump -e "for i=1,100 do end"
0001    int SLOAD  #1    I
0002  + int ADD    0001  +1  
0003 >  int LE     0002  +100
0004 ------ LOOP ------------
0005  + int ADD    0002  +1  
0006 >  int LE     0005  +100
0007    int PHI    0002  0005

Is really this (except it would be less readable):

-0005   int KINT   [     +100]
-0004   int KINT   [       +1]
0001    int SLOAD  #1    I
0002  + int ADD    0001  -0004
0003 >  int LE     0002  -0005
0004 ------ LOOP ------------
0005  + int ADD    0002  -0004
0006 >  int LE     0005  -0005
0007    int PHI    0002  0005

As you can see, all constants are close together. LuaJIT's IR is
very compact (64 bit per instruction). Doing a lookup in the KINT
chain is more or less a linear lookup in the same cache line, i.e.
basically free.