[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Pentium 4 and misaligned doubles
- From: Rici Lake <lua@...>
- Date: Mon, 15 Aug 2005 23:47:15 -0500
I was having some trouble understanding a microbenchmark result on Lua
5.1work6
(also noted by Mike Pall) [note 1]
for i = 1, 1e8 do end
4.09 real 4.07 user 0.00 sys
local t; for i = 1, 1e8 do end
1.16 real 1.14 user 0.01 sys
The difference is the vm op which is executed 100 million times (full
disassembly is below):
FORLOOP 1, -1 (fast) FORLOOP 0, -1 (slow)
This didn't appear to happen on 5.0.2, but I eventually managed to get
it to; it just happens at a different point on the stack. By varying
the number of locals created (and thus the stack index of the FORLOOP
operation) from 0 to 64, I observed that the timing has a periodicity
of 16. I constructed the following table by taking timing 10^9
iterations and multiplying by the processor speed in GHz to get the
approximate number of cycles; it might be out by one or two but the
timings are quite repeatable. [Note 2]
dstreg 5.1w6 5.0.2
------ ----- -----
0 113 38
1 32 38
2 32 38
3 32 38
4 32 38
5 32 64
6 32 56
7 32 111
8 32 38
9 32 38
10 32 38
11 32 38
12 32 38
13 113 38
14 48 38
15 39 38
<repeats, mod 16>
Now, the L1 cache of a Pentium 4 is effectively organized in units of
64 bytes. Consequently, every 16th stack slot contains a (potential)
double which cross a cache-line boundary. The for loop in 5.0.2 reads
three consecutive stack slots, and writes to the first one; the for
loop in 5.1w6 reads three consecutive stack slots, writes to the first
one, and then writes to the next consecutive slot; the above timings
are completely consistent with that pattern of accesses if it were the
case that a Pentium 4 has a penalty for reading misaligned doubles
which cross a cache line, and a huge penalty for writing.
Armed with that thought and google, I found the following interesting
reference:
<http://www.cygnus-software.com/papers/x86andinfinity.html>
in which you will find (in the reference to "test results") the
following interesting chart, which I've abbreviated quite a bit: [Note
3]
Identifier count time Mops/sec Cycles/op
Slowdown
Loads and stores - alignment tests
dword aligned src , 50000000, 0.04021, 1243.500, 2.252,
1.018
cache unaligned src , 2000000, 0.01441, 138.763, 20.178,
9.121
dword aligned dst , 50000000, 0.03964, 1261.332, 2.220,
1.003
cache unaligned dst , 2000000, 0.05760, 34.725, 80.634,
36.447
And, interestingly, the 20 cycle penalty for cache-unaligned loads and
80 cycle penalty for cache-unaligned stores corresponds amazingly
closely to the observed data.
Even more interesting was the fact that Athlon AMD chips don't suffer
from this problem. (I just mention that in passing, but it may
influence my future buying decisions :)
[Note 1]
The output comes from the following script:
echo $1
/usr/bin/time ~/src/lua-5.1work5/src/lua -e "$1"
echo
[Note 2]
The timings were done on a Pentium 4 2.8 GHz, with both versions of Lua
compiled with gcc version 3.3.3 with the options:
-O3 -fno-crossjumping -fomit-frame-pointer -march=i686
which is so far what has given me the best results on that CPU.
[Note 3]
The referenced paper is mostly about infinities, NaNs and denormalized
numbers, and if you use those which much frequency, you'll find the
results even more, ummm, interesting. 861 cycles to add two infinities
seems just a tad excessive.
I tested both comparison with infinity and adding infinity in Lua; it
appears that there is no similar penalty to comparing with infinity
(phew! I use that idiom) but the penalty for adding infinity seems to
be pretty much as described. (See the third test)
local t; local inf=1/0; for j = 1, 1e8 do local i = inf + j end
41.49 real 41.30 user 0.07 sys
local t; local inf=1e8; for j = 1, 1e8 do local i = inf + j end
2.14 real 2.12 user 0.01 sys
In other words, based on the above result that the for loop itself take
31 cycles, it would appear that an addition takes about 28 cycles,
unless it happens to be cache unaligned in which case it takes 48
cycles, or unless it happens to be an addition of infinity in which
case it takes somewhat over 1000 cycles.
[Note 4]
I also ran these tests with 5.1work6 on Mac OS X with a 1 GHz G4. It
exhibited neither of the variations; the for-loop test was consistent
regardless of stack index, and the addition test did not appear to be
affected by the sum. An empty numeric for loop appears to execute in
about 39 cycles; the addition, whether by infinity or by 1e8, takes
about 29 additional cycles. Of course, on the PowerPC the stack
double-word aligned, so the for-loop test is not a fair comparison.
[Luac disassembly of initial test programs]
luac output: for i = 1, 1e8 do end
1 [1] LOADK 0 -1 ; 1
2 [1] LOADK 1 -2 ; 100000000
3 [1] LOADK 2 -1 ; 1
4 [1] FORPREP 0 0 ; to 5
5 [1] FORLOOP 0 -1 ; to 5
6 [1] RETURN 0 1
luac output: local t; for i = 1, 1e8 do end
1 [1] LOADNIL 0 0
2 [1] LOADK 1 -1 ; 1
3 [1] LOADK 2 -2 ; 100000000
4 [1] LOADK 3 -1 ; 1
5 [1] FORPREP 1 0 ; to 6
6 [1] FORLOOP 1 -1 ; to 6
7 [1] RETURN 0 1