accumulate locally and mutatively but combine concurrently (a use case)

Fri May 10 09:05:38 PDT 2013

A good way to think about Collector is that it is about constructing 
*builders*, not final results.  It turns out that many of the builders 
we might compute (lists and maps) are also just fine as final results.

> 2a. Each thread accumulates results into a mutable aggregate ArrayList<T>.
> 2b. (Or the aggregate could be something even cleverer that doesn't do log(N) copying of its values.)

This is our "SpinedBuffer" implementation class.

> 2c. Each such ArrayList<T> is serially modified by thread-confined side effects.
> 2d. The JIT can understand the loop that does this.
> 2e. At the smallest scale, the writes into the arrays are performed by unrolled loops, even vectorized ones, with minimal per-element overhead.

All of the (2) requirements are (or should be) handled by the framework 
requirements outlined in Collector.

> 3a. Each thread produces a finished ArrayList<T> aggregate of results for roll-up.
> 3b. Each thread "freezes" its result aggregate before handoff.  The resulting frozen aggregate is (1) thread-safe and (2) immutable.

So, it's 3(b)(2) that we don't support.  This was a deliberate choice, 
but maybe there are considerations that were missed at the time.

The basic idea is that you only want to do the finishing step at the 
very last minute.  Let's say your mutable representation is an "unboxed" 
form (e.g., a mutable counter) and your final representation is a boxed 
form.  There's no value in boxing the leaf results; this is all wasted 
effort.  Better to combine the counters as we go up the tree, and box at 
the end.

And here, I think, is where our assumptions might part ways.  I'm 
thinking "fork join", and you're thinking "gpu".  The "combine the 
counters" step is trivially easy to do in FJ, but involves much more 
complexity when mapping to a GPU substrate.

The Collector API offers a choice here -- combine the containers mutably 
(dump the right into the left) or create a new combined container that 
might just be a shallow copy.  But that choice belongs to the Collector 
writer (do I do leftList.addAll(rightList), or create a new view of 
concat(left, right)), not the framework implementor.

> 3c. (For some mutable aggregates, the freezing step could also include operations like right-sizing or balancing.)
> 3d. The freezing operation includes a suitable memory fence operation to settle all side effects to the aggregate.

Basically, Collector is more "stripped down" that this; its a 
straightforward extension of fold to mutable builders.  We did look into 
adding additional hook points, but its not clear what the right ones 
are.  The one everyone wants is a transform that is applied to the root 
result (StringBuilder::toString), which is obvious and desirable but 
painful with cascaded reductions like group-by, and in the end seemed to 
be more complexity than it was worth.

But, a per-leaf transform which would handle (3), and whose default was 
a no-op, might be perfectly tractible.  Again, though, the choices here 
accrue to the Collector writer, not the reduction implementor.

> 4. The guy(s) that passed out stream chunks waits for the resulting aggregate and combines them, using O(log N) pointer pasting operations.

This choice is distributed among log n "combiner" invocations.

> It seems to me that step 2 is well represented by a STRICTLY_MUTATIVE but not CONCURRENT Collector.

The concurrent case is never going to map well to a GPU; this is the 
case where we have a data structure for which concurrent inserts are 
cheaper than merging (which is true for many map operations, which again 
are not going to map well to GPUs.)

> Do we need a collector mode which says, "I accumulate locally and mutatively but I combine concurrently"?

When you say "combine concurrently", do you really mean "combine 
functionally"?