[Date Prev][Date Next][Thread Prev][Thread Next]
- Subject: Re: on the cost of non-blocking LuaSockets
- From: William Ahern <william@...>
- Date: Mon, 5 Jan 2015 14:10:26 -0800
On Mon, Jan 05, 2015 at 05:19:47PM +0100, Valerio Schiavoni wrote:
> On Fri, Dec 19, 2014 at 10:04 PM, William Ahern <email@example.com>
> > I suspect there's a bug or performance issue specific to your code or to
> > LuaSocket. To me it sounds like you have (or LuaSocket has) set the socket
> > to a 0-second timeout, but didn't restore the timeout to non-0 before
> > polling. If so, the socket would continually signal read readiness even
> > though there's nothing to read. Read readiness is what you'd see when the
> > kernel socket timeout expires. That would convert your poll loop into a
> > busy
> > loop and it definitely could lower throughput. That's my hunch, but I've
> > never used LuaSocket before, and only have passing familiarity with the
> > code.
> You rise an interesting point here. In my case, once a socket is created,
> its timeout is set to 0-second for all its lifetime, and never set back to
> not-0. Is that wrong ?
Yes. Because when you poll on the socket it will always poll as ready for
reading, regardless of whether there's any data to read. That means every
call to select, poll, or epoll will immediately return; it's not actually
> When (that is, at which moment in the life of a socket) is it correct to
> set a 0-timeout ?
I've never used luasocket, but AFAIU older versions (pre 2.0?) didn't
support O_NONBLOCK. The workaround was to set SO_RCVTIMEO and SO_SNDTIMEO to
0 before attempting the read or write, respectively. But you were supposed
to reset those values to non-0 before polling.
> In the context of a non-blocking event-based runtime, it did not sound
> horribly wrong to set those sockets to a 0-timeout once and forever.
It is horribly wrong :(
> > Plus, you typically don't stop reading on a socket until you've drained the
> > input;
> Here you probably refer to the time required to drain the input read
> buffer, right ? That one is small
I think the context was about the cost+benefit of using persistent events
with epoll and kqueue. You can just ignore it for the purposes of resolving
your immediate issue.
> > In a recent benchmark of mine I generated 5,000 individual compressed
> > real-time audio streams (i.e. each stream is a unique transcoding from a
> > source feed, with individualized 30-second ad spots inserted every
> > 30-seconds) and sending them across a LAN. That saturates my 10Gb/s test
> > network, and I never even bothered with low-level performance
> > optimizations.
> > The core loop assembling the 5,000 personalized streams and writing them to
> > 5,000 sockets is actually single threaded.
> Is this test using luasockets ? If so, I'd be very interested in looking at
> the source code.
No, I don't use luasockets. I usually use my own project, cqueues, at
http://www.25thandclement.com/~william/projects/cqueues.html. In addition to
other features, cqueues makes it easier for me write hybrid C and Lua
servers, whether or not the core loop is driven from C or Lua. But that's
probably not be a requirement for you if you're using luasockets.
However, the above benchmark was mostly in C. In my media streaming server
the core streaming engine--and the benchmark utility--is all in C. I use
Lua+cqueues as the policy engine to direct connections (audio stream, meta
data request, etc), select advertisements, tracking incoming broadcast
stream state, etc. The policy engine runs in its own process and I
communicate with it using IPC (Unix domain sockets). I haven't benchmarked
the policy engine directly because it's never been a bottleneck. During
benchmarking I have to throttle the rate of incoming connections, otherwise
there's too much packet loss on the network. So I haven't had occasion to
bother benchmarking the Lua components directly. That's been the pattern
with most of what I do: the core application I/O code is never the
bottleneck, but usually the network or the data processing.
 In production, at least. I usually run single-process, single-threaded
during development because it makes debugging easier.
 OTOH I always use cqueues or one of several similar event systems I've
written, and I'm not shy about farming out components to C, or wrapping an
existing C library into a Lua module. Which makes it all the more important
that my Lua-based event framework work well with C modules, just like Lua
excels at mixing C and script code. And it's important for my C modules to
be self-contained and not have built-in dependencies on any single
framework. For example, I've written a non-blocking MySQL client library in
C. It also comes with Lua bindings and wrappers to make it look like a
LuaSQL driver. And I use it in C with libevent, in Lua/C with a framework
built on libevent, in Lua/C with a predecessor to cqueues, and in Lua/C with
> Did you do a similar test where a machine receive those 5,000 streams over
> 5,000 server sockets ? I'd be curious to know which throughput you reached.
That was the test: one socket per listening stream, and each listening
stream is unique--the same broadcast audio but interspersed every 30 seconds
with a per-listener ad spot, so the loop isn't simply directly copying data
to the output buffer of 5,000 sockets. There's also a small number of
sockets for IPC and various other tasks on the server, but basically it's
O(N) in the number of sockets. So there were over 5,000 sockets on the
server, and precisely 5,000 sockets in the client benchmark utility. But the
server and client were on separate hardware sitting on a LAN separated by a
Ultimately it was the Linux kernel and the network that had trouble keeping
up with the amount of data, largely because of so many small writes. It's a
real-time streaming server which tries it's best to minimize latency (mostly
for demo purposes so it sounds more responsive when comparing the
over-the-air broadcast with the transcoded one), so except for the initial
connection it doesn't buffer more a single compressed frame of data before
writing it out to the socket. That's roughly about 1 packet per second per
socket, depending on the codec. At scale that's very taxing on the kernel
TCP stack and the network because of all the ACKs.