accumulate locally and mutatively but combine concurrently (a use case)
Brian Goetz
brian.goetz at oracle.com
Fri May 10 09:05:38 PDT 2013
A good way to think about Collector is that it is about constructing
*builders*, not final results. It turns out that many of the builders
we might compute (lists and maps) are also just fine as final results.
> 2a. Each thread accumulates results into a mutable aggregate ArrayList<T>.
> 2b. (Or the aggregate could be something even cleverer that doesn't do log(N) copying of its values.)
This is our "SpinedBuffer" implementation class.
> 2c. Each such ArrayList<T> is serially modified by thread-confined side effects.
> 2d. The JIT can understand the loop that does this.
> 2e. At the smallest scale, the writes into the arrays are performed by unrolled loops, even vectorized ones, with minimal per-element overhead.
All of the (2) requirements are (or should be) handled by the framework
requirements outlined in Collector.
> 3a. Each thread produces a finished ArrayList<T> aggregate of results for roll-up.
> 3b. Each thread "freezes" its result aggregate before handoff. The resulting frozen aggregate is (1) thread-safe and (2) immutable.
So, it's 3(b)(2) that we don't support. This was a deliberate choice,
but maybe there are considerations that were missed at the time.
The basic idea is that you only want to do the finishing step at the
very last minute. Let's say your mutable representation is an "unboxed"
form (e.g., a mutable counter) and your final representation is a boxed
form. There's no value in boxing the leaf results; this is all wasted
effort. Better to combine the counters as we go up the tree, and box at
the end.
And here, I think, is where our assumptions might part ways. I'm
thinking "fork join", and you're thinking "gpu". The "combine the
counters" step is trivially easy to do in FJ, but involves much more
complexity when mapping to a GPU substrate.
The Collector API offers a choice here -- combine the containers mutably
(dump the right into the left) or create a new combined container that
might just be a shallow copy. But that choice belongs to the Collector
writer (do I do leftList.addAll(rightList), or create a new view of
concat(left, right)), not the framework implementor.
> 3c. (For some mutable aggregates, the freezing step could also include operations like right-sizing or balancing.)
> 3d. The freezing operation includes a suitable memory fence operation to settle all side effects to the aggregate.
Basically, Collector is more "stripped down" that this; its a
straightforward extension of fold to mutable builders. We did look into
adding additional hook points, but its not clear what the right ones
are. The one everyone wants is a transform that is applied to the root
result (StringBuilder::toString), which is obvious and desirable but
painful with cascaded reductions like group-by, and in the end seemed to
be more complexity than it was worth.
But, a per-leaf transform which would handle (3), and whose default was
a no-op, might be perfectly tractible. Again, though, the choices here
accrue to the Collector writer, not the reduction implementor.
> 4. The guy(s) that passed out stream chunks waits for the resulting aggregate and combines them, using O(log N) pointer pasting operations.
This choice is distributed among log n "combiner" invocations.
> It seems to me that step 2 is well represented by a STRICTLY_MUTATIVE but not CONCURRENT Collector.
The concurrent case is never going to map well to a GPU; this is the
case where we have a data structure for which concurrent inserts are
cheaper than merging (which is true for many map operations, which again
are not going to map well to GPUs.)
> Do we need a collector mode which says, "I accumulate locally and mutatively but I combine concurrently"?
When you say "combine concurrently", do you really mean "combine
functionally"?
More information about the lambda-dev
mailing list