Pentium 4 and misaligned doubles

lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

Subject: Pentium 4 and misaligned doubles
From: Rici Lake <lua@...>
Date: Mon, 15 Aug 2005 23:47:15 -0500

I was having some trouble understanding a microbenchmark result on Lua5.1work6

(also noted by Mike Pall) [note 1]

for i = 1, 1e8 do end
        4.09 real         4.07 user         0.00 sys

local t; for i = 1, 1e8 do end
        1.16 real         1.14 user         0.01 sys

The difference is the vm op which is executed 100 million times (fulldisassembly is below):

  FORLOOP 1, -1  (fast)    FORLOOP 0, -1   (slow)

This didn't appear to happen on 5.0.2, but I eventually managed to getit to; it just happens at a different point on the stack. By varyingthe number of locals created (and thus the stack index of the FORLOOPoperation) from 0 to 64, I observed that the timing has a periodicityof 16. I constructed the following table by taking timing 10^9iterations and multiplying by the processor speed in GHz to get theapproximate number of cycles; it might be out by one or two but thetimings are quite repeatable. [Note 2]



dstreg  5.1w6   5.0.2
------  -----   -----
     0    113     38
     1     32     38
     2     32     38
     3     32     38
     4     32     38
     5     32     64
     6     32     56
     7     32    111
     8     32     38
     9     32     38
    10     32     38
    11     32     38
    12     32     38
    13    113     38
    14     48     38
    15     39     38
<repeats, mod 16>

Now, the L1 cache of a Pentium 4 is effectively organized in units of64 bytes. Consequently, every 16th stack slot contains a (potential)double which cross a cache-line boundary. The for loop in 5.0.2 readsthree consecutive stack slots, and writes to the first one; the forloop in 5.1w6 reads three consecutive stack slots, writes to the firstone, and then writes to the next consecutive slot; the above timingsare completely consistent with that pattern of accesses if it were thecase that a Pentium 4 has a penalty for reading misaligned doubleswhich cross a cache line, and a huge penalty for writing.

Armed with that thought and google, I found the following interestingreference:


<http://www.cygnus-software.com/papers/x86andinfinity.html>

in which you will find (in the reference to "test results") thefollowing interesting chart, which I've abbreviated quite a bit: [Note3]

Identifier count time Mops/sec Cycles/op Slowdown
       Loads and stores - alignment tests
dword aligned src , 50000000, 0.04021, 1243.500, 2.252, 1.018cache unaligned src , 2000000, 0.01441, 138.763, 20.178, 9.121dword aligned dst , 50000000, 0.03964, 1261.332, 2.220, 1.003cache unaligned dst , 2000000, 0.05760, 34.725, 80.634, 36.447

And, interestingly, the 20 cycle penalty for cache-unaligned loads and80 cycle penalty for cache-unaligned stores corresponds amazinglyclosely to the observed data.

Even more interesting was the fact that Athlon AMD chips don't sufferfrom this problem. (I just mention that in passing, but it mayinfluence my future buying decisions :)



[Note 1]
The output comes from the following script:

echo $1
/usr/bin/time ~/src/lua-5.1work5/src/lua -e "$1"
echo

[Note 2]

The timings were done on a Pentium 4 2.8 GHz, with both versions of Luacompiled with gcc version 3.3.3 with the options:


  -O3 -fno-crossjumping -fomit-frame-pointer -march=i686

which is so far what has given me the best results on that CPU.

[Note 3]

The referenced paper is mostly about infinities, NaNs and denormalizednumbers, and if you use those which much frequency, you'll find theresults even more, ummm, interesting. 861 cycles to add two infinitiesseems just a tad excessive.

I tested both comparison with infinity and adding infinity in Lua; itappears that there is no similar penalty to comparing with infinity(phew! I use that idiom) but the penalty for adding infinity seems tobe pretty much as described. (See the third test)


local t; local inf=1/0; for j = 1, 1e8 do local i = inf + j end
       41.49 real        41.30 user         0.07 sys
local t; local inf=1e8; for j = 1, 1e8 do local i = inf + j end
        2.14 real         2.12 user         0.01 sys

In other words, based on the above result that the for loop itself take31 cycles, it would appear that an addition takes about 28 cycles,unless it happens to be cache unaligned in which case it takes 48cycles, or unless it happens to be an addition of infinity in whichcase it takes somewhat over 1000 cycles.


[Note 4]

I also ran these tests with 5.1work6 on Mac OS X with a 1 GHz G4. Itexhibited neither of the variations; the for-loop test was consistentregardless of stack index, and the addition test did not appear to beaffected by the sum. An empty numeric for loop appears to execute inabout 39 cycles; the addition, whether by infinity or by 1e8, takesabout 29 additional cycles. Of course, on the PowerPC the stackdouble-word aligned, so the for-loop test is not a fair comparison.


[Luac disassembly of initial test programs]
luac output: for i = 1, 1e8 do end
        1       [1]     LOADK           0 -1    ; 1
        2       [1]     LOADK           1 -2    ; 100000000
        3       [1]     LOADK           2 -1    ; 1
        4       [1]     FORPREP         0 0     ; to 5
        5       [1]     FORLOOP         0 -1    ; to 5
        6       [1]     RETURN          0 1

luac output: local t; for i = 1, 1e8 do end
        1       [1]     LOADNIL         0 0
        2       [1]     LOADK           1 -1    ; 1
        3       [1]     LOADK           2 -2    ; 100000000
        4       [1]     LOADK           3 -1    ; 1
        5       [1]     FORPREP         1 0     ; to 6
        6       [1]     FORLOOP         1 -1    ; to 6
        7       [1]     RETURN          0 1

Follow-Ups:
- Re: Pentium 4 and misaligned doubles, Rici Lake
- Re: Pentium 4 and misaligned doubles, Mike Pall
- Simple Lua for scripts, Alain

Prev by Date: Re: Packaging and importing
Next by Date: Re: Pentium 4 and misaligned doubles
Previous by thread: Re: Stripping HTML tags
Next by thread: Re: Pentium 4 and misaligned doubles
Index(es):
- Date
- Thread