lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


I was having some trouble understanding a microbenchmark result on Lua 5.1work6
(also noted by Mike Pall) [note 1]

for i = 1, 1e8 do end
        4.09 real         4.07 user         0.00 sys

local t; for i = 1, 1e8 do end
        1.16 real         1.14 user         0.01 sys

The difference is the vm op which is executed 100 million times (full disassembly is below):
  FORLOOP 1, -1  (fast)    FORLOOP 0, -1   (slow)

This didn't appear to happen on 5.0.2, but I eventually managed to get it to; it just happens at a different point on the stack. By varying the number of locals created (and thus the stack index of the FORLOOP operation) from 0 to 64, I observed that the timing has a periodicity of 16. I constructed the following table by taking timing 10^9 iterations and multiplying by the processor speed in GHz to get the approximate number of cycles; it might be out by one or two but the timings are quite repeatable. [Note 2]


dstreg  5.1w6   5.0.2
------  -----   -----
     0    113     38
     1     32     38
     2     32     38
     3     32     38
     4     32     38
     5     32     64
     6     32     56
     7     32    111
     8     32     38
     9     32     38
    10     32     38
    11     32     38
    12     32     38
    13    113     38
    14     48     38
    15     39     38
<repeats, mod 16>

Now, the L1 cache of a Pentium 4 is effectively organized in units of 64 bytes. Consequently, every 16th stack slot contains a (potential) double which cross a cache-line boundary. The for loop in 5.0.2 reads three consecutive stack slots, and writes to the first one; the for loop in 5.1w6 reads three consecutive stack slots, writes to the first one, and then writes to the next consecutive slot; the above timings are completely consistent with that pattern of accesses if it were the case that a Pentium 4 has a penalty for reading misaligned doubles which cross a cache line, and a huge penalty for writing.

Armed with that thought and google, I found the following interesting reference:

<http://www.cygnus-software.com/papers/x86andinfinity.html>

in which you will find (in the reference to "test results") the following interesting chart, which I've abbreviated quite a bit: [Note 3]

Identifier               count      time    Mops/sec  Cycles/op  Slowdown
       Loads and stores - alignment tests
dword aligned src    ,  50000000,  0.04021, 1243.500,    2.252,    1.018 cache unaligned src  ,   2000000,  0.01441,  138.763,   20.178,    9.121 dword aligned dst    ,  50000000,  0.03964, 1261.332,    2.220,    1.003 cache unaligned dst  ,   2000000,  0.05760,   34.725,   80.634,   36.447

And, interestingly, the 20 cycle penalty for cache-unaligned loads and 80 cycle penalty for cache-unaligned stores corresponds amazingly closely to the observed data.

Even more interesting was the fact that Athlon AMD chips don't suffer from this problem. (I just mention that in passing, but it may influence my future buying decisions :)


[Note 1]
The output comes from the following script:

echo $1
/usr/bin/time ~/src/lua-5.1work5/src/lua -e "$1"
echo

[Note 2]
The timings were done on a Pentium 4 2.8 GHz, with both versions of Lua compiled with gcc version 3.3.3 with the options:

  -O3 -fno-crossjumping -fomit-frame-pointer -march=i686

which is so far what has given me the best results on that CPU.

[Note 3]

The referenced paper is mostly about infinities, NaNs and denormalized numbers, and if you use those which much frequency, you'll find the results even more, ummm, interesting. 861 cycles to add two infinities seems just a tad excessive.

I tested both comparison with infinity and adding infinity in Lua; it appears that there is no similar penalty to comparing with infinity (phew! I use that idiom) but the penalty for adding infinity seems to be pretty much as described. (See the third test)

local t; local inf=1/0; for j = 1, 1e8 do local i = inf + j end
       41.49 real        41.30 user         0.07 sys
local t; local inf=1e8; for j = 1, 1e8 do local i = inf + j end
        2.14 real         2.12 user         0.01 sys

In other words, based on the above result that the for loop itself take 31 cycles, it would appear that an addition takes about 28 cycles, unless it happens to be cache unaligned in which case it takes 48 cycles, or unless it happens to be an addition of infinity in which case it takes somewhat over 1000 cycles.

[Note 4]
I also ran these tests with 5.1work6 on Mac OS X with a 1 GHz G4. It exhibited neither of the variations; the for-loop test was consistent regardless of stack index, and the addition test did not appear to be affected by the sum. An empty numeric for loop appears to execute in about 39 cycles; the addition, whether by infinity or by 1e8, takes about 29 additional cycles. Of course, on the PowerPC the stack double-word aligned, so the for-loop test is not a fair comparison.

[Luac disassembly of initial test programs]
luac output: for i = 1, 1e8 do end
        1       [1]     LOADK           0 -1    ; 1
        2       [1]     LOADK           1 -2    ; 100000000
        3       [1]     LOADK           2 -1    ; 1
        4       [1]     FORPREP         0 0     ; to 5
        5       [1]     FORLOOP         0 -1    ; to 5
        6       [1]     RETURN          0 1

luac output: local t; for i = 1, 1e8 do end
        1       [1]     LOADNIL         0 0
        2       [1]     LOADK           1 -1    ; 1
        3       [1]     LOADK           2 -2    ; 100000000
        4       [1]     LOADK           3 -1    ; 1
        5       [1]     FORPREP         1 0     ; to 6
        6       [1]     FORLOOP         1 -1    ; to 6
        7       [1]     RETURN          0 1