SSE optimized memcpy not faster


From: Valerio Schiavoni
Sent: Tuesday, December 09, 2014 4:36 PM
To: Lua mailing list
Subject: Re: Understanding 'perf report' result lua 5.2.3: __memcpy_sse2_unaligned ?

Hello Roberto,
thanks for your explanation.

On Tue, Dec 9, 2014 at 3:36 PM, Roberto Ierusalimschy
<> wrote:
What is it happening that triggers that many '__memcpy_sse2_unaligned' ?

If I understood the report correctly, there is no indication that there
are too many '__memcpy_sse2_unaligned'; it is big only in comparison
with the rest. If all your server does is to move data around (e.g.,
it reads it from somewhere, creates a Lua string with it, and then writes
it somewhere else),

Well, in my test-case, this is all the server does:

local data = clientsocket:receive(payload_size)

As you see, the data is read/received from a (non-blocking) LuaSocket
and then simply ignored until the end of the function.

On a 1Gbs-network, this single call to receive takes an average of 5.3
seconds when the payload_size is big (128MB).
Should I think that it takes sometime for  LuaSocket binding to copy
the received data back into the stack (somewhere here
) ?