[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Re: on the cost of non-blocking LuaSockets
- From: William Ahern <william@...>
- Date: Fri, 19 Dec 2014 13:04:01 -0800
On Thu, Dec 18, 2014 at 11:50:43AM +0100, Valerio Schiavoni wrote:
> Thanks to everyone for the interesting insights on the thundering herd
> issue, I surely learned more than expected from this discussion.
>
> However, i still have the doubt that there is an inherent performance issue
> when using non-blocking sockets attached to co-routines that need to read
> big chunks of data (hundreds of MB).
> Do I understand correctly: the fact that the coroutines need to
> periodically yield/resume imply that the achievable download throughput is
> necessary reduced ?
> This seems like a big issue to me...
I suspect there's a bug or performance issue specific to your code or to
LuaSocket. To me it sounds like you have (or LuaSocket has) set the socket
to a 0-second timeout, but didn't restore the timeout to non-0 before
polling. If so, the socket would continually signal read readiness even
though there's nothing to read. Read readiness is what you'd see when the
kernel socket timeout expires. That would convert your poll loop into a busy
loop and it definitely could lower throughput. That's my hunch, but I've
never used LuaSocket before, and only have passing familiarity with the
code.
It's true that doing I/O on a non-blocking socket will be slightly less
performant (higher latency, higher CPU utilization) than doing the same I/O
on a blocking socket, all things else being equal--single threaded, single
socket. Imagine two processes sending messages over a pipe. If they're using
blocking sockets, when process B writes some data the kernel can immediately
wakeup process A--in fact, it might even be able to avoid copying the data
to the pipe buffer altogether, and instead copy it straight to process A. So
at a minimum you're doing fewer context switches. This is but one of many
optimizations that can happen when using blocking I/O, and it's why
message-passing kernels don't have to be slow.
However, except on embedded processors this normally wouldn't translate into
perceivable reduced throughput. That's because CPUs are not the bottleneck,
the network is the bottleneck, and then the memory system, and then only
lastly the CPU. The difference in CPU performance is negligible on high-end
CPUs. And when you start scaling to thousands of sockets, you use less CPU
doing non-blocking I/O than using blocking I/O. At the scale of thousands of
sockets, using 1:1 threads per socket wastes time on superflous scheduling
and context switching.
Plus, you typically don't stop reading on a socket until you've drained the
input; or don't stop writing until you've filled the buffer. If you're
pushing many megabytes or gigabytes of data, that will further reduce the
cost (if it exists in your scenario) of non-blocking I/O. However, if you
perform only a single read or write and without waiting for EAGAIN before
polling, that might be problematic.
In a recent benchmark of mine I generated 5,000 individual compressed
real-time audio streams (i.e. each stream is a unique transcoding from a
source feed, with individualized 30-second ad spots inserted every
30-seconds) and sending them across a LAN. That saturates my 10Gb/s test
network, and I never even bothered with low-level performance optimizations.
The core loop assembling the 5,000 personalized streams and writing them to
5,000 sockets is actually single threaded.
At peak CPU (during the start of each 30-second ad insertion) the biggest
CPU hog is the kernel handing the NIC I/O. And the NIC is being serviced by
the same CPU (again, never bothered with performance optimizations, such as
IRQ CPU pinning), so even though this is a 4-core Xeon E3, it's all
effectively running on a single core.
The only other performance problem I've ever encountered with non-blocking
sockets is when using the select or poll syscalls. When daemons start
polling on 4k, 5k, or 8k+ sockets, the userspace-to-kernel copy and the
kernel scan of the pollset or fd_set begins costing significant CPU time.
Because CPUs are so fast and networks so slow, in many (perhaps most)
applications the return set usually only includes a few descriptors at most,
even if they're all seemingly active from a human's perspective. Even if
you're processing several kilobytes or even megabytes of data per socket per
second, in the universe of the tiny gremlins slaving away in your CPU, each
socket is dormant the vast majority of the time.
For 8k sockets, that means for every unique event the poll syscall itself is
doing tens or perhaps hundreds of kilobytes of memory I/O (reading,
scanning, checking). Imagine all 8k sockets signal read readiness within the
span of 1s, but every poll syscall only returns 1 descriptor as ready. (This
is possible!) 8k calls results in upwards of 500MB of data that is generated
in 1s. Just for the polling! (8K * 8K * sizeof (struct pollfd)). And that
translates into gigabytes of memory I/O in the kernel, because the kernel is
scanning 8k data structures internally, too.
This is why we use epoll and kqueue.
However, I've seen systems that misuse epoll and kqueue. I recently did a
review of the Mongrel webserver for a company, which many consider to be an
exemplar of scalable polling. The Mongrel author was convinced (and had
benchmarks seemingly proving it) that using poll/select was faster under
some scenarios. However, Mongrel's use of epoll and kqueue was suboptimal as
a result of the way Mongrel implemented non-blocking I/O atop a C-based
coroutine/fiber library, preventing it from fully leveraging event
persistence. The benefits of epoll and kqueue melt away if you're needlessly
adding and deleting events on each descriptor, which is what Mongrel does.
During peak activity the computational complexity of Mongrel's use of
epoll/kqueue degenerates to that of poll/select.
IIRC, everytime Mongrel resumes a coroutine it deletes the socket descriptor
from epoll/kqueue that put the coroutine to sleep. That effectively results
in two add and del operations per yield-resume cycle per coroutine. Which is
precisely what you're trying to avoid by using epoll/kqueue. He had to do
this because he didn't know if, when the coroutine resumed, the application
code would close the socket. If it closed the socket but didn't tell the
scheduler, the scheduler's descriptor and event state would become stale and
corrupted. Because Mongrel cannot detect when a socket descriptor is closed
and doesn't require the application code running atop Mongrel to notify it,
it simply removes the descriptor from the polling set before resuming a
coroutine. This is why in the author's benchmarks poll/select seemed to be
faster for active descriptors. But it shouldn't be surprising that emulating
the Big-O complexity of poll/select through the contrivace of epoll is going
to be slower.
There are two ways around this problem, but it's too complex to discuss
here. Suffice it to say that the approach that libevent and nginx take (and
by extension any application that employ the I/O layer of libevent or nginx)
is _not_ the one that I prefer, as it impedes integration and composition of
libraries and modules. This is why I wrote my own little event loop,
cqueues, for Lua.