accumulate locally and mutatively but combine concurrently (a use case)

Thu May 9 18:51:44 PDT 2013

How does a Collector represent the following computation:

1. Somebody passes out stream chunks to several processor threads.

2a. Each thread accumulates results into a mutable aggregate ArrayList<T>.
2b. (Or the aggregate could be something even cleverer that doesn't do log(N) copying of its values.)
2c. Each such ArrayList<T> is serially modified by thread-confined side effects.
2d. The JIT can understand the loop that does this.
2e. At the smallest scale, the writes into the arrays are performed by unrolled loops, even vectorized ones, with minimal per-element overhead.

3a. Each thread produces a finished ArrayList<T> aggregate of results for roll-up.
3b. Each thread "freezes" its result aggregate before handoff.  The resulting frozen aggregate is (1) thread-safe and (2) immutable.
3c. (For some mutable aggregates, the freezing step could also include operations like right-sizing or balancing.)
3d. The freezing operation includes a suitable memory fence operation to settle all side effects to the aggregate.

4. The guy(s) that passed out stream chunks waits for the resulting aggregate and combines them, using O(log N) pointer pasting operations.

It seems to me that step 2 is well represented by a STRICTLY_MUTATIVE but not CONCURRENT Collector.

And the other steps are well represented by a CONCURRENT and possibly UNORDERED one.

But running step 2 with a CONCURRENT Collector would be bad, since even if each thread had its own accumulator, the accumulator would have to provide for safely concurrent access.  The JVM can improve the computation with such things using biased locking and lock coarsening, but they will never be fast like a normal loop into a local array.

Do we need a collector mode which says, "I accumulate locally and mutatively but I combine concurrently"?

Even more:  Should the transition from mutative to immutable be represented by a type shift?  In Collector<T,M,C>, M is the mutable aggregate, and each thread would have an export step that freeze its M into a C value for subsequent combination (roll-up) across threads:

    BiFunction<M, T, M> accumulator();
    Function<M, C> freezer();
    BinaryOperator<C> combiner();

(Note:  "freezer" is not a real proposal!)

Having the extra distinction of M vs. C will cause friction in other places, but perhaps it is an important distinction between larval and adult stages[1] of a data structure.  The distinction can be erased, but even then there still needs to be (in step 3b above) an explicit operation (a Consumer<R>, the final side effect!) which freezes a value of type R.

    BiFunction<R, T, R> accumulator();
    Consumer<R> freezer();
    BinaryOperator<R> combiner();

Given the choice I would prefer the type distinction.  Even better would be for someone to point out the right way to use the current Collector API to do steps 1-4 above.

– John

[1] https://blogs.oracle.com/jrose/entry/larval_objects_in_the_vm