Re: PATCH: true C coroutines -- yield across C stack from anywhere

lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

Subject: Re: PATCH: true C coroutines -- yield across C stack from anywhere
From: Mike Pall <mikelu-0410@...>
Date: Fri, 22 Oct 2004 18:09:34 +0200

Hi,

Roberto Ierusalimschy wrote:
> Your point of making the use of a C stack optional is quite important,
> so that users still have coroutines in systems not supported by
> C_CORO. My question is, once the system supports C_CORO, what is the
> point of making the C stack optional? In other words, why offer this
> option to the programmer?  Do you think it can be expensive to create
> "true" coroutines?

Well, it does come with a price. It is however unnoticeable unless you
create thousands of coroutines. But for this category of problems (google
for 'c10k') you get into trouble when allocating a stack for each coroutine.
Unfortunately these are excactly the kind of problems that would benefit
most from linear control flow in C code enabled by 'true' C coroutines. :-|

Here are a few data points:

1. The pure creation (no call) of coroutines is 3 times slower for
   coroutines with a minimal C stack (4 KB) than without one.
   [To be fair: the time is dominated by the kernel memory management.]

   On my machine a single coroutine creation takes ~8 microseconds
   (without C stacks) vs. ~24 microseconds (with C stacks).
   [This includes Lua VM overhead since the test was written in Lua].

2. Switching back and forth between two (otherwise empty) coroutines
   with C stacks is ~25% slower than for coroutines without C stacks.
   The non-unwinding lua_cyield() is only slightly faster than the
   unwinding lua_yield(), because the overhead is dominated by C stack
   switching (this could be optimized a bit more, though).

   On my machine a single roundtrip coroutine switch (resume + yield)
   takes ~0.8 microseconds (without C stacks) vs. ~1 microsecond
   (with C stacks).
   [This includes Lua VM overhead since the test was written in Lua].

3. Sizing the C stacks is hard. You can be wasteful and ignore the issue
   when you have just a few coroutines. But (say) 256 KB each may be too
   much for some embedded environments. If you have tons of coroutines
   you have to be really careful (10000 coroutines at 4 KB = ~40 MB,
   at 256 KB = ~2.5 GB). Virtual memory may help you here, though
   (since this works well for C stacks under native threading).

   [There are tons of postings from people that are unhappy with the
    default stack size settings for some native threading packages.
    In both directions, I should mention.]

Note: This is not a comparison of the patched vs. unpatched Lua core itself.
      Not using the new feature (e.g. C stack size = 0) does not show any
      noticeable slow-down compared to the unpatched Lua core.

The numbers sound really bad at first, but (BIG BUT):

You may ignore problem #1 if you have a fairly fixed set of coroutines
or start to recycle them or do not have such a large coroutine creation
rate (true for many cases).

You may ignore problem #2 if the main overhead is not dominated by
coroutine switching (true for any non-benchmarking use of coroutines).

However problem #3 is hard to ignore except for trivial cases.

Ok, other languages have bitten the bullet and 'just do it':

E.g. Ruby supports true C coroutines in the core VM loop (with C stack
copying and not C stack switching -- a tad slower, but has less memory
overhead). The scheduler in the Ruby VM is awful, though. Stackless Python
has similar concepts for when the C stack cannot be unfolded easily
(standard Python (CPython) still does not have true coroutines).
Edgar mentioned 'Io' which uses the same C stack switching code. I guess
we will see more languages with support for C coroutines in the future
(e.g. this is planned for perl6).

Another open question is: How to decide which coroutines get a C stack and
which don't (and how large should the stack be)? A semi-automatic solution
is only available with C stack copying and not with C stack switching
(I may be wrong on this though). But this would require a fixed C stack
below lua_resume(), which is against the spirit of the Lua C API (not
enforcing a certain scheduler). Right now I chickened out and left the
decision up to the application programmer (which may not be such a bad
thing after all).

To summarize: I think it is important to keep the benefits of lightweight
coroutines (without C stacks) even for platforms where a better solution
(full C coroutines) is available. This would be a distinctive advantage
for Lua compared to other language implementations, too.

On the matter of how to make that code pluggable for the Lua core:

This is a pretty difficult issue. One would either need a dozen defines that
patch very specific code snippets at just the right points (which is not
a maintainable solution) or find a more powerful API abstraction to the
whole coroutine system (sorry, I have no suggestion for this (yet)).

I think C coroutines are such a 'basic' feature that it really should
be supported by the core VM of any language. With my solution for Lua they
are at least not mandatory and the non-portable code is all in one file.

BTW: The lua_lock/lua_unlock precedent is a bad one. I played with it once
and basically gave up. There was just no elegant way to integrate more
complex stuff without adding all sorts of new hooks in various places
of the Lua core. The other problem was, that these macros are a real
performance killer due to the way they are used at every Lua/C API boundary.
Anyway, I gave up on doing native threading within the same Lua universe
(but it has its place when each thread has its own Lua universe).

Bye,
     Mike

References:
- PATCH: true C coroutines -- yield across C stack from anywhere, Mike Pall
- Re: PATCH: true C coroutines -- yield across C stack from anywhere, Roberto Ierusalimschy

Prev by Date: RE: PATCH: true C coroutines -- yield across C stack from anywhere
Next by Date: Re: PATCH: true C coroutines -- yield across C stack from anywhere
Previous by thread: Re: PATCH: true C coroutines -- yield across C stack from anywhere
Next by thread: The book
Index(es):
- Date
- Thread