Coro in Java 8

Thu Apr 28 07:33:15 PDT 2011

On Thu, Apr 28, 2011 at 5:03 AM, Lukas Stadler <lukas.stadler at jku.at> wrote:
> I totally agree as well, naturally :-)
>
> And I do plan on spending more time on the coroutine patch again, I
> didn't do much coding in the last months because I was occupied at
> university.
> The result of this is my thesis on coroutines:
> http://ssw.jku.at/Research/Papers/Stadler11Master/

How would you feel about leading a JSR? :)

Seriously though, that may be the most important thing we need to do
to get this ball rolling. I could act as interim JSR lead, but this is
obviously not my area of expertise (and I don't know much about the
JSR process). Eventually someone that either knows the implications of
JVM coroutines and/or the vagaries of the JSR process would want to
lead it.

> The coroutines in the current state actually provide full stack
> reification and can simulate continuations, albeit with suboptimal
> performance.
> But this is confined to two concrete methods (serialize/deserialize),
> and can thus easily be restricted with permissions.

Man, you need to get that patch working so I can play with it again :)
JRuby 1.7 is going to be my "experimenting with Java 7 and beyond)
release, so I'm keen to get back into coro experimentation (and tailc,
if that patch could get updated too).

> I would be really interested to hear what you guys think are the most
> important characteristics for coroutines... since an implementation is
> always a tradeoff:
> * fast creation
> * fast 1st activation (the first time a coroutine gets to run)
> * fast switching
> * fast migration from thread to thread
> * many coroutines (~100)
> * lots of coroutines (~ 10000)
> * lots and lots of coroutines (~ 1000000)

tl;dr: lots of coroutines and fast switching

For me, the pain of non-coro coroutines (i.e. using Threads) is in two places:

1. The cost of spinning up and managing Thread-based coroutines (and
coping with the inevitable GC rooting that happens)
2. The cost of cross-thread signaling and context-switching

For 1: coro is already *vastly* cheaper than spinning up threads, so
you've got that problem nailed. I don't think I've ever asked about GC
rooting...if I walk away from a live-but-not-running coroutine, does
it GC? Currently JRuby's Enumerator#next logic (which uses
Thread-based coroutines) has to do a complicated
WeakReference/finalization dance to safely shut down the coroutine
thread when the Enumerator object goes away. Fibers don't even do that
much; if a Fiber doesn't complete, it will live forever. Ouch. I think
coroutines must be GCable and not act as a GC root.

It's also already possible to spin up *way* more coroutines than
Threads. I have heard tales of people using tens or hundreds of
thousands of Fibers in Ruby, but it's uncommon; usually we're talking
about one or a handful of Fibers per web request (for a Fiber-driven
server, say). Given that, the ability to do "lots and lots" probably
should be there, but I'm not sure even in the weirdest cases that it
will be common.

On the other hand, I talked with Jim Baker (Jython) about the
potential for using "lots and lots" of coroutines, one per
(language-level, not Java-level) method invocation, to do a poor-man's
stackless language implementation. In that case we'd need as many
coroutines as there are active method invocations on the stack, which
could easily be tens of thousands. This is, again, a rather weird and
specialized case.

For 2: As you've seen in your benchmarks comparing JRuby's
Thread-based Fiber with your coro-based Fiber, the signaling and
switching mechanism we use currently is *very* slow. The best I've
come up with is currently in JRuby, and uses two SynchronousQueue
objects (input and output) to signal and communicate. I had a separate
implementation that used a single SynchronousQueue and it was faster,
but it failed to terminate gracefully (there was always either a
sender or receiver waiting on the queue at the end).

I suspect the overhead is largely due to depending on OS thread
scheduling to (hopefully) schedule the Fiber's thread.

coro, on the other hand, does its switch in-place, without any
dependency on OS-level thread-scheduling semantics. For my money, coro
is already so much faster than Thread-to-Thread communication that any
improvement you make would just be gravy.

Now for the other features you offered:

* fast creation

coro is already so much faster than Thread, I'm quite satisfied here
already. But if we get into "lots and lots" we'd definitely see the
impact of creation (presumably "lots and lots" would mean many
short-lived coroutines).

* fast 1st activation (the first time a coroutine gets to run)

How much of a difference are we talking in the best and worst cases?
>From what I've seen, there are two use cases for Ruby Fibers: as
one-shot (or "few-shot") continuations to do cross-call flow control;
and as generators. The first case would see a lot of benefit from
faster 1st invocation; the second would see less.

* fast switching

Again, already so much faster than Thread-to-Thread that I'm happy.
And again, "lots and lots" implies there would be a lot of
context-switching too...

* fast migration from thread to thread

Currently Ruby's Fiber is not migratable across threads. In JRuby's
current impl, we could easily migrate Fibers across threads with no
additional cost (i.e. it would remain as slow as it is regardless of
migration). If it's possible to make migration across threads fast
without sacrificing any same-thread performance, I'd say do it. But I
see this as the *least* important feature on the list...same-thread
"microthreading" is going to be *far* more common.

Now this *would* play into the "functional concurrency" model I talked
about earlier...ideally we'd be able to pass coroutines to worker
threads to simulate an M:N threading model. That would allow us "lots
and lots" of active calculations to be spread across a few threads,
rather than the current complicated gymnastics required to make
threads gracefully chew on a dataset. I picture an ideal future where
it's so cheap to spin up and run coroutines that you'd spin up one for
every (nontrivial) calculation you want to do in parallel, and then
dump them into an Executor to be run. Because they're coroutines, the
Executor could at will suspend any of them, or just run each to
completion before grabbing another. Call it the "coroutine as
pausable/resumable job" model.

* many coroutines (~100)
* lots of coroutines (~ 10000)
* lots and lots of coroutines (~ 1000000)

The cases for "lots and lots" start to sound very domain-specific,
don't they? If supporting "lots and lots" degrades any other features,
it should be a configurable option, e.g. -XX:+LotsAndLotsOfCoroutines.

> I'll soon fork of a version that only supports lots of coroutines,
> because this makes a lot of things much easier (it doesn't require the
> whole copy to/from stack logic).
>
> And I also think that a jsr as a rallying point would help a lot - I'll
> see what it will take to create one.

I've been a poor Java contributor since I never got on the 292 JSR
after leaving Sun (I couldn't be on it while there), but I promise if
any other language/VM-related JSRs come up I will leap right on them.
And I officially offer my help in making a JSR happen (as much as my
time permits, of course).

- Charlie