proposed fork-join framework

Sat Dec 4 13:46:28 PST 2010

Hi Ed,

Just so it doesn't seem to everyone that I'm ignoring
your concerns, here's a summary that I think covers
the main points in your document:

ForkJoinPool is an ExecutorService (as are others
in java.util.concurrent, such as ThreadPoolExecutor).
It differs from others mainly by virtue of using
work-stealing, which is a decentralized scheduling
technique that best applies to directed acyclic
parallel computations. Work-stealing has evolved since
the 1990s, and is now used in many lightweight
task frameworks; including for example the custom
scheduler used in hotspot's parallel garbage collectors.

The main twist in FJ compared to most others is linking
work-stealing to the use of Futures (ForkJoinTasks),
which provides a more accessible and pleasant basis for
creating applications and layered libraries. (Although FJ
can also be used in other ways, for example with
completion callbacks, quiescence  monitoring, etc rather
the Future-based join operation).
The framework is more complicated internally than most
Executors because it must efficiently manage decentralized
state to support these diverse usages. For an academic/hacker,
this was the fun part.

The ForkJoin framework has comparable support for
errors and unexpected behavior as other ExecutorServices.
The main differences are that the specs are clearer
about the conditions under which programs can stall
(i.e., when they are stuck on IO or are not DAG-structured),
that we include some (evadable) nuisances (like
not allowing IOExecptions) to encourage appropriate use,
and we provide a means by which advanced users can further
intervene (via ManagedBlocker)

The performance of the framework is pretty good (and
improves over time). One nice property of (some, not all)
work-stealing based Fork/Join computations is that they
are have good locality and so their performance is not
very sensitive to the many ways that caches can be
arranged on multicores and multiprocessors.
Also, while I hope to avoid benchmark arguments,
I note that you cite only a comparison of two
completely different algorithms. The FJ demo uses
a variant the Blelloch parallel prefix algorithm, that
I find to outperform statically partitioned ones (as
I believe you used) only with more than 8 cores or so.

If you think I've missed any of your concerns, or you
(or anyone else) have other concerns, please feel free
to ask.

-Doug