Re: on the cost of non-blocking LuaSockets

Hello Williams,

On Fri, Dec 19, 2014 at 10:04 PM, William Ahern <william@25thandclement.com> wrote:

> However, i still have the doubt that there is an inherent performance issue
> when using non-blocking sockets attached to co-routines that need to read
> big chunks of data (hundreds of MB).
> Do I understand correctly: the fact that the coroutines need to
> periodically yield/resume imply that the achievable download throughput is
> necessary reduced ?
> This seems like a big issue to me...

I suspect there's a bug or performance issue specific to your code or to
LuaSocket. To me it sounds like you have (or LuaSocket has) set the socket
to a 0-second timeout, but didn't restore the timeout to non-0 before
polling. If so, the socket would continually signal read readiness even
though there's nothing to read. Read readiness is what you'd see when the
kernel socket timeout expires. That would convert your poll loop into a busy
loop and it definitely could lower throughput. That's my hunch, but I've
never used LuaSocket before, and only have passing familiarity with the
code.

You rise an interesting point here. In my case, once a socket is created, its timeout is set to 0-second for all its lifetime, and never set back to not-0. Is that wrong ?

When (that is, at which moment in the life of a socket) is it correct to set a 0-timeout ?

In the context of a non-blocking event-based runtime, it did not sound horribly wrong to set those sockets to a 0-timeout once and forever.

Plus, you typically don't stop reading on a socket until you've drained the
input;

Here you probably refer to the time required to drain the input read buffer, right ? That one is small

or don't stop writing until you've filled the buffer. If you're
pushing many megabytes or gigabytes of data, that will further reduce the
cost (if it exists in your scenario) of non-blocking I/O. However, if you
perform only a single read or write and without waiting for EAGAIN before
polling, that might be problematic.

In a recent benchmark of mine I generated 5,000 individual compressed
real-time audio streams (i.e. each stream is a unique transcoding from a
source feed, with individualized 30-second ad spots inserted every
30-seconds) and sending them across a LAN. That saturates my 10Gb/s test
network, and I never even bothered with low-level performance optimizations.
The core loop assembling the 5,000 personalized streams and writing them to
5,000 sockets is actually single threaded.

Is this test using luasockets ? If so, I'd be very interested in looking at the source code.

Did you do a similar test where a machine receive those 5,000 streams over 5,000 server sockets ? I'd be curious to know which throughput you reached.

At peak CPU (during the start of each 30-second ad insertion) the biggest
CPU hog is the kernel handing the NIC I/O. And the NIC is being serviced by
the same CPU (again, never bothered with performance optimizations, such as
IRQ CPU pinning), so even though this is a 4-core Xeon E3, it's all
effectively running on a single core.

The only other performance problem I've ever encountered with non-blocking
sockets is when using the select or poll syscalls. When daemons start
polling on 4k, 5k, or 8k+ sockets, the userspace-to-kernel copy and the
kernel scan of the pollset or fd_set begins costing significant CPU time.
Because CPUs are so fast and networks so slow, in many (perhaps most)
applications the return set usually only includes a few descriptors at most,
even if they're all seemingly active from a human's perspective. Even if
you're processing several kilobytes or even megabytes of data per socket per
second, in the universe of the tiny gremlins slaving away in your CPU, each
socket is dormant the vast majority of the time.

For 8k sockets, that means for every unique event the poll syscall itself is
doing tens or perhaps hundreds of kilobytes of memory I/O (reading,
scanning, checking). Imagine all 8k sockets signal read readiness within the
span of 1s, but every poll syscall only returns 1 descriptor as ready. (This
is possible!) 8k calls results in upwards of 500MB of data that is generated
in 1s. Just for the polling! (8K * 8K * sizeof (struct pollfd)). And that
translates into gigabytes of memory I/O in the kernel, because the kernel is
scanning 8k data structures internally, too.

I'm not facing this kind of load, though these are very interesting numbers, thanks for sharing.