Re: 'setobj' in lua-5.4.0-alpha-rc2 become more faster

Yes. I think you are right. So the cost is Penalty.

The answer in <<64-ia-32-architectures-optimization-manual>> I found it.

In lua5.3.4 source code.

the last code in 'vmcase(OP_FORLOOP)', the last code is 'setivalue(ra+3, idx)', and 'setivalue' will be using separate assignments(one for `value_`, and one for `tt_`, the `value_` is 64 bit, and `tt_` is 32 bit).

Then, luaVM execute the `vmcase(OP_MOVE): setobjs2s(L, ra, RB(i)`, the setobjs2s also use two assignments. But all the two assignment are for 64bit data.

From the <<Intel® 64 and IA-32 Architectures Optimization Reference Manual>> => Chapter 3.6.5 => Fingure 3-3 => Condition (b).

`Size of Load > Store` will has penalty.

Andrew Gierth <andrew@tao11.riddles.org.uk> 于2019年6月16日周日上午3:03写道：

>>>>> ">" == 重归混沌 <findstrx@gmail.com> writes:

>> I modify the lua test code and do nothing with lua5.3.4 source code, then
>> setobj become faster.

>> Most likely, the key point is cache. But no answer in
>> <<64-ia-32-architectures-optimization-manual.

OK, I found out why this happens.

When the OP_FORLOOP code assigns a value to the visible loop variable,
it does so using two separate assignments (see the setivalue macro): one
to the value, one to the tag. The first is a 64-bit move, the second a
32-bit one. The 32-bit padding at the end of the TValue is not modified.

It turns out that when you write a 32-bit value, and then immediately
read the same location as a 64-bit or larger value, then this causes a
stall in the processor, even (apparently) if everything is hot enough to
already be in L1 cache. Presumably the pending store, which is on some
memory write pipeline, is treated as invalidating any memory fetch which
overlaps it.

If instead you write a 32-bit value and then immediately read it back
_as a 32-bit value_, then there is no stall, presumably because the
processor can fetch the whole value out of the write pipeline.

--
Andrew.