lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


On 9 April 2011 02:58, Robert G. Jakabosky <bobby@sharedrealm.com> wrote:
> On Friday 08, Matthew Wild wrote:
>> Ed sent this to me off-list, but suggested that he shouldn't have as
>> my response was worth a wider audience.
>>
>> In fact I am also interested to know if anyone has experience of the
>> issues below, or the various libraries. I did try LuaLanes a couple of
>> years back, but only got it to deadlock. Now that it is somewhat more
>> maintained now I should look into it again when I get the chance.
>
> I have created a light weight threads module for Lua [1].  Each thread gets
> it's own lua_State instance and there is no locking/shared state.  To
> communicate between running threads you have to use a communications module
> (LuaSocket or the preferred lua-zmq [2]).
>
> My ZeroMQ bindings (lua-zmq [2]) has a wrapper object 'zmq.threads' [3] for
> the llthreads module, which makes it easy to start threads from a block of lua
> code or a lua file.
>

Unfortunately ZeroMQ isn't currently recommended for use on the public
internet, which is why I have stayed away from it. We need to
implement clustering support in Prosody anyway (multiple machines
hosting the same logical service) and sometimes the nodes are going to
be distant. I'd rather not rely on firewall rules to keep Prosody safe
in such a case.

It doesn't seem to make sense to me to have multiple means for Prosody
processes to communicate.

> More comments below:
>

>> * One thread handling connections, and multiple worker threads for
>> processing messages from a queue (producer/consumer style):
>>  This approach may not bring much performance as it may end up with a
>> number of global locks around things like storage and network. Most
>> code would be untouched, except for where locks are required.
>
> Don't use shared state between the threads.  Keep the communication between
> threads async. (i.e. use message passing).
>

Agreed. Thankfully this is the direction most of the available
threading libraries go in.

>> * Thread pool for specific blocking tasks, such as storage (few
>> storage truly async storage APIs are available)
>>   This is currently our favoured approach. We just scale out the few
>> parts we know can be bottlenecks, the majority of work still happens
>> in one main thread, meaning no locking or other threading issues can
>> appear so easily.
>
> Creating a pool of worker threads with lua-zmq would be easy to do.  Also the
> ZeroMQ sockets can be added to the main event loop to get notices when work is
> finished.  See the 'handler.zmq' [8] object for how to embed a ZeroMQ socket
> into an lib-ev event loop.  It shouldn't be to hard to port that to the
> libevent interface.
>

Thanks, I wasn't aware this was possible.

>> * What I call the "magic" LuaProc/erlang style, where we can just put
>> each session in a coroutine, and let the scheduler deal with the
>> multi-CPU scaling.
>>  This is attractive as it requires very little work on the threading,
>> we can just focus on using coroutines and message passing. It should
>> also be possible to easily scale down to 1 thread.
>
> It would be interesting to see a distributed version of ConcurrentLua that
> used ZeroMQ for thread-to-thread & host-to-host message passing.
>

It would indeed. Let me know when it's released :)

>> And of course another significant reason is that CPU is rarely the
>> bottleneck for an XMPP server, it's typically RAM (when you have tens
>> of of thousands of clients with various state, you do need a decent
>> server RAM-wise). So scaling across CPUs has yet to be a priority for
>> us.
>
> Where is all the RAM being used?  Have you profiled the memory usage?

Oh, necessary stuff like status messages and contact lists. We've
developed various tools for inspecting memory usage (we really should
release these, they've found reference leaks for us so many times...).
While there's always room for improvement, I don't think there's much
we can cut down on. It's still a fact that for the typical XMPP server
use connections are open and idle more than they are
sending/receiving.

> I would
> think that the XML encode/decoding would use a lot of CPU.
>

Actually parsing is very fast, our CPU bottleneck under load is with
the XML serialization. Since we parse every message into a mini DOM
structure we have to serialize it again (usually with modifications)
on the way out. We've optimized the socks off the serialization code,
and I even rewrote it in C at one point (to little gain, so I didn't
commit it).

I think we're at the stage where the best way to optimize is by
cutting down the amount of serialization and interning of network data
we do, e.g. by storing incoming data and parsing in-place, and using
the source buffer to write out the other side if we made no
modifications. daurnimator and I had a bit of a hack session chasing
this idea, and the results were interesting but still need some work
to get what we're after. I'm considering taking advantage of the FFI
now to save interning and re-serializing stuff we don't need to.

>> More of a priority is scaling across multiple machines - ie. having
>> multiple Prosody instances serving the same domain (for load-balancing
>> and reliability). This would also allow multiple Prosodies to run on
>> one machine and share the CPUs.
>
> ZeroMQ would allow you to scale from multiple threads to many servers.  I
> would recommend diagramming the flow of messages from client connections to
> different parts of the 'backend' services (i.e. storage, database, message
> routing services).  If you separate those shared components into units that
> communicate with messages, then you can either keep them local or move them to
> different threads or servers.  Also if you use the design patterns from the
> ZeroMQ guide [6] you can scale each backend service from one instance to
> multiple instances as needed, without changing the code.
>

Interestingly this is how the earliest XMPP daemons were constructed,
except using XMPP between the components. It's certainly an
interesting option, I only worry about how much code (and complexity)
it'll add for something that the majority of deployments don't need.

>> We don't use a coroutine per connection like copas does, we use
>> callbacks instead. Some rough benchmarks when we first started hinted
>> that we could save quite a bit on resources by not using coroutines.
>
> I agree about the resource overhead of coroutines.
>

Good, I don't have my original benchmarks anymore and it's good to
know I haven't spent the past 3 years living a lie :)

Thanks,
Matthew