lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


On Fri, Aug 22, 2014 at 11:58:03PM -0700, Coroutines wrote:
<snip>
> The cost of serializing an object and sharing it through some method
> of IPC is more than it needs to be.  I take issue with the concept of
> it.  When I have brought this up before I'm told to look elsewhere for
> inefficiency, but I am someone who likes to solve benign problems.
> Also lock-free algorithms are a thing.  Message-passing will never be
> as quick as zero-copied, shared data ;>

Lock-free doesn't mean wait-free. This is an important difference because if
it's not wait-free then it can seriously complicate your entire application.
Now you have to carefully juggle multiple tasks in an intricately dependent
manner. It makes abstraction very difficult.

Also, lock-free algorithms use atomic instructions and memory barriers,
which are cache killers. This is why traditional mutexes, even in the
uncontended case, are undesirable.

So there are many instances where serialization makes sense and is much more
performant. In fact, it'll almost always be faster for short messages that
fit in cache, and can be made to scale just as well transactional algorithms
because by its nature you have minimal data dependencies. (Note that message
passing does not necessarily exclude the use of shared memory.)

In any event it's nearly impossible to make use of transactional algorithms
from a language like Lua. There are no commodity chips currently in
existence that support primitives like strong LL/SC or even something as
simple as DCAS (not double-width CAS, but CAS of two non-contiguous
addresses). If processors had strong LL/SC then we'd be living in a
concurrency utopia. But lock-free and/or wait-free algorithms on x86 are
stuck with CAS or double-width CAS and are therefore very simplistic, or
like RCU require very careful attention to low-level flow control. In
neither case do they lend themselves to the kind of abstraction that would
make them easy to expose in a generic manner within a scripting language.

Most "software transactional" projects actually use regular mutexes
underneath the hood. See, for example, PyPy, or any of the various C
libraries. For something like PyPy you also need a copying garabage
collector.

Even Intel's new TSX extensions aren't really transactional. They guarantee
neither lock-free nor wait-free behavior, nor are they sufficiently strong
(like strong LL/SC) to be able to implement lock-free, wait-free behavior
using software-based algorithms. They're simply optimizations embedded in
the CPUs cache coherency logic, and it turns out they were broken anyhow.
(See recent Intel errata announcement.)

Most of the lock-free algorithm stuff is just hype. I got really excited
about it 10 years ago. I spend hundreds of dollars on ACM Queue and IEEE
papers to learn all the algorithms, most of which were discovered in the
1980s and early 1990s. Then I learned that almost all modern hardware is
entirely inadequate to implement most of the well-known transactional
algorithms. And the few you can implement are seriously hobbled, which makes
them hard to build abstractions atop. Sure, you can create a lock-free
singly-linked list. But how do you integrate that into Lua? You can't even
implement a lock-free doubly-linked list that supports arbitrary element
insertion; with CAS the best you can do is support insertions/deletions at
the tails. If you look at "lock-free" trees, etc, they have very narrowly
tailored usage restrictions. Or RCU. It's insanely difficult to use these
things because their usage restrictions bleed into almost every aspect of
your code, much more so than even message passing.

But what about the famous Java Azure platform with its 100% concurrent
memory collector? Turns out those were custom-made chips that provided
LL/SC. But their custom CPUs were so slow that they got better performance
(with similar latency) by simply writing a low-level virtual machine on x86
which implemented their cache coherency model. And that sat _below_ their
JVM. And in any event Azure is wwaayyy slower than regular, mutex-based
JVMs, even when you scale up. Azure provides low-latency but crappy
throughput. Which is why neither Sun, Oracle, nor IBM were interested in
acquiring them.

So, yeah, tons of hype all around. Depressing, but it's the reality. These
algorithms can be very useful and very performant, but not as uniformly as
message passing.