lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


On Thu, Jan 12, 2017 at 12:46:43PM -0800, Coda Highland wrote:
> On Thu, Jan 12, 2017 at 10:33 AM, William Ahern
> <william@25thandclement.com> wrote:
> > On Thu, Jan 12, 2017 at 07:42:49AM -0800, Coda Highland wrote:
<snip>
> >> Python and C/C++ (and maybe Obj-C?) have stackful coroutines.
> >
> > Python doesn't have stackful coroutines. Quite the opposite. Same for C++. A
> > stackful coroutine means you can yield from nested function invocations
> > _without_ having to specifically invoke the nested function as a coroutine.
> >
> > As for C and C++, multiple stacks don't really count, otherwise you could
> > argue that any language with threads has coroutines. But coroutines aren't
> > just about maintaining state of function invocations, but about how control
> > flow is passed.
> 
> Okay, so maybe I'm wrong about Python, but I'm not wrong about the C
> family. You're right that multiple stacks on its own aren't equivalent
> to coroutines, but they provide the essential structure necessary to
> implement them. You can't just look at the core language definition
> and stop there. I've used a coroutine library before. It's really not
> bad at all.
> 
> Also, ES7 has them.

AFAIU, like on Python they're not stackful coroutines. The problem in both
Python and Javascript is that the major implementations rely too heavily on
the C stack. Frames of interpreted and JIT-compiled code are intermixed with
frames of C and assembly code. And often times state for in-language objects
is kept on the C stack.

Neither wants to play games juggling fixed-sized C stacks, which is what
LuaJIT 5.x did; nor dynamically copying state across stacks, like what Go
does, and which requires an incredible amount of complex instrumentation and
carefully orchestrated run-time code to correctly locate and move variables
(see the Go 1.7 stack resize bug!); nor maintain a more complex stack data
structure, like Lua and LuaJIT 2.x, which they see as too costly.

Given these self-imposed limitations, coroutines in those languages
(including for languages with similar constructs, like async/await in C#)
are limited to a single function invocation. You can invoke coroutines
recursively, but each new invocation requires a new heap-allocation for the
frame state, and you have to decorate either the call site or the function
definition site so the compiler understands the call semantics. So you have
to decorate each call of a resumable function with "yield" or "await", or
qualify the definition with "async" or "function*". There are solutions
which in Javascript and Python which can hide this chaining, but they merely
transform the code into the equivalent decorated constructs. There's no
avoiding how the implementation works under the hood.

It's IMO a half-solution. Like with promises, it has poor performance if
heavily used because you're constantly creating new dynamic objects just to
invoke a function.[1] It breaks function composition because in order for a
leaf invocation to yield, all invocation beneath the leaf and down to the
resume point has to be explicitly decorated. This means you can't use
coroutines as liberally and transparently as you do in Lua, even when
they're the ideal abstraction (e.g. wrapping a callback interface as an
iterator). And in general it's just a very leaky abstraction that obfuscates
what is often an inherently sequential flow of control, and which often
should remain contextually sequential. For example, most asynchronous I/O
operations in the context of a single network client is logically
sequential, and it should usually remain sequential to maintain throughput
back-pressure under highly concurrent load. In other words, a client request
is often implemented as a series of sequential I/O operations over one or
more data sources, and trying to optimize the single request case (e.g. by
doing multiple database queries concurrently in order to reduce latency) can
have the effect of diminishing overall throughput.

Lua's stackful, asymmetric coroutines are IMHO particularly elegant. For the
cost of not being able to use the C stack for state generally (at least not
in a simple manner), the coroutine is kept lightweight even though stackful.
The setup and initial space cost is about the same as a closure, but unlike
with promises or chained non-stackful coroutines, it's a one-time thing,
amortized over the entire logical task, and doesn't impose any
inter-procedural costs (as when using trampolines or chains) because while
not using the C stack, it can still use a linear stack for function
invocations, rather than the effectively linked-list type approach required
for the CSP-style.

I've never used Go, but I really _love_ the idea of Go precisely because
they really understand the value (in terms of elegance and performance) of
well-implemented closures and coroutines, even though implementing them well
is very difficult. Go came out about the same time I began intial work on my
cqueues Lua module for concurrent I/O. And my approach to writing highly
concurrent network services--the details and API, not the whole
green-threading thing which was old-hat by then--was partly informed by my
study of Go's predecessor, Limbo.[2] These papers are very instructive
regarding the costs and benefits of how coroutines and closures are
implemented in Lua, are good introductions to how Go implements them
(details significantly different, of course), and help to explain why other
language implementations would have so much difficulty implementing one or
the other.
  
  http://www.cs.tufts.edu/~nr/cs257/archive/roberto-ierusalimschy/closures-draft.pdf
  http://www.inf.puc-rio.br/~roberto/docs/MCC15-04.pdf
  http://www.inf.puc-rio.br/~roberto/docs/corosblp.pdf
  http://queue.acm.org/detail.cfm?id=1983083

[1] Counterintuitively, for some highly concurrent, high load, asynchronous
I/O network applications, smart but simple Lua 5.3 code could theoretically
outperform equivalent code running on V8 or the .NET runtime, despite not
being JIT'd and lacking other performance advantages. Neither an AOT nor JIT
compiler can easily or cheaply (if at all) inline a function across a resume
if it can't prove it won't immediately yield back. Likewise, a JIT compiler
can't easily or cheaply trace across a yield/resume, and probably wouldn't
bother, even though for many cases (like using coroutines to create
iterators from callbacks) tracing could produce similar results as for tight
numeric loops. The trick would be to avoid doing complex string operations
in Lua. I have an idea which I'll be adding to cqueues for how to abstract
typical string operations into a C module so that Lua code can control the
logic without the cost of of juggling the strings in the Lua VM; it stems
from from my attempts to refactor LPeg to support asynchronous I/O and
yielding.

[2] I've always assumed that the Lua developers were inspired by the Dis
Virtual Machine underlying Limbo, which sparked serious interest in
register-based virtual machines. This paper

  http://www.vitanuova.com/inferno/papers/dis.html

was published in 2000, 5 years before the 2005 paper describing the Lua 5.0
register-based virtual machine

  https://www.lua.org/doc/jucs05.pdf