lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


I test it with 'gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-36)'.

My disassembly doesn't similar to yours(I used `gcc -g -O2 -c lvm.s` and `objdump -j .text -S lvm.o`).

the old setobj ------------------------>
    vmfetch();
    vmdispatch (GET_OPCODE(i)) {
      vmcase(OP_MOVE) {
        setobjs2s(L, ra, RB(i));
    1f50: 41 c1 ed 17           shr    $0x17,%r13d
    1f54: 49 c1 e5 04           shl    $0x4,%r13
    1f58: 4b 8b 04 2f           mov    (%r15,%r13,1),%rax
    1f5c: 4b 8b 54 2f 08       mov    0x8(%r15,%r13,1),%rdx
    1f61: 48 89 03             mov    %rax,(%rbx)
    1f64: 48 89 53 08           mov    %rdx,0x8(%rbx)
    1f68: 48 8b 75 28           mov    0x28(%rbp),%rsi
        vmbreak;
    1f6c: e9 6f f3 ff ff       jmpq   12e0 <luaV_execute+0x40>
    1f71: 0f 1f 80 00 00 00 00 nopl   0x0(%rax)
        vmbreak;
      }
the new one<---------------------------------------------------------------
    vmfetch();
    vmdispatch (GET_OPCODE(i)) {
      vmcase(OP_MOVE) {
        setobjs2s(L, ra, RB(i));
    1f08: 41 c1 ed 17           shr    $0x17,%r13d
    1f0c: 49 c1 e5 04           shl    $0x4,%r13
    1f10: 4d 01 fd             add    %r15,%r13
        vmbreak;
      }
      vmcase(OP_LOADK) {
        TValue *rb = k + GETARG_Bx(i);
        setobj2s(L, ra, rb);
    1f13: 49 8b 45 00           mov    0x0(%r13),%rax
    1f17: 48 89 03             mov    %rax,(%rbx)
    1f1a: 41 8b 45 08           mov    0x8(%r13),%eax
    1f1e: 89 43 08             mov    %eax,0x8(%rbx)
    1f21: 48 8b 45 28           mov    0x28(%rbp),%rax
        vmbreak;
    1f25: e9 a6 f3 ff ff       jmpq   12d0 <luaV_execute+0x40>
    1f2a: 66 0f 1f 44 00 00     nopw   0x0(%rax,%rax,1)
        vmbreak;
      }
----------------------------------------------------------------------------------

In the disassembly, the difference is:

1.  `mov    (%r15,%r13,1),%rax`  vs `add    %r15,%r13` and `mov    0x0(%r13),%rax`
2. old setobj is load,load,store,store, and new setobj is load,store,load,store
3. the register size is different. when assign tt_

But, it seems that no matter how the compiler generate code, the new setobj is always faster old setobj.

Maybe compiler want to hint something?

By the way, can you tell me what's the `performance monitors` you used.
 


Andrew Gierth <andrew@tao11.riddles.org.uk> 于2019年6月15日周六 上午2:17写道:
>>>>> ">" == 重归混沌  <findstrx@gmail.com> writes:

 >>  I test it in lua5.3.4 source code.

Aha.

What compiler were you using? Because I just tried it with clang 8.0.0
and gcc8, and the results are pretty fascinating (but quite unexpected).
The newer version of setobj is _much_ faster for me on that test code (I
increased the iteration count to 512*1024*1024, and the timings are 3.0
seconds vs. 4.1 seconds(!)), but for reasons that seem to have basically
nothing to do with the code.

The test program compiles into bytecode that alternately executes
OP_MOVE and OP_FORLOOP. OP_MOVE is this, in the source code:

      vmcase(OP_MOVE) {
        setobjs2s(L, ra, RB(i));
        vmbreak;
      }

I'm getting this compiled code with the original setobj, using clang8
with -O2 -march=core2:

803           vmcase(OP_MOVE) {
804             setobjs2s(L, ra, RB(i));
   0x0000000000417390 <+256>:   shr    $0x13,%r11
   0x0000000000417394 <+260>:   and    $0xfffffff0,%r11d
   0x0000000000417398 <+264>:   mov    (%r9,%r11,1),%rax
   0x000000000041739c <+268>:   mov    0x8(%r9,%r11,1),%rcx
   0x00000000004173a1 <+273>:   jmpq   0x417b2c <luaV_execute+2204>

[...]

842           vmcase(OP_GETTABLE) {

[...]
   0x0000000000417b2c <+2204>:  mov    %rcx,0x8(%r12)
   0x0000000000417b31 <+2209>:  mov    %rax,(%r12)
   0x0000000000417b35 <+2213>:  jmpq   0x417330 <luaV_execute+160>
   0x0000000000417b3a <+2218>:  mov    %rbx,-0x68(%rbp)

1019            vmbreak;

In other words, the compiler is splitting up the load and the store in
OP_MOVE's setobj and implementing the store part by jumping to identical
code at the end of OP_GETTABLE. The new setobj on the other hand gives:

803           vmcase(OP_MOVE) {
804             setobjs2s(L, ra, RB(i));
   0x0000000000417160 <+256>:   shr    $0x13,%r10
   0x0000000000417164 <+260>:   and    $0xfffffff0,%r10d
   0x0000000000417168 <+264>:   mov    (%r8,%r10,1),%rax
   0x000000000041716c <+268>:   mov    %rax,(%r12)
   0x0000000000417170 <+272>:   mov    0x8(%r8,%r10,1),%eax
   0x0000000000417175 <+277>:   mov    %eax,0x8(%r8,%rbx,1)
   0x000000000041717a <+282>:   jmp    0x417100 <luaV_execute+160>

so there's no intermediate jump, just loads and stores.

You wouldn't expect one unconditional jump in a sequence like this to
make a huge difference, but analyzing the code with performance monitors
shows a very large difference in the number of resource stalls, and the
unconditional jmpq   0x417b2c <luaV_execute+2204> is the primary hotspot
for them.

--
Andrew.