Extending Collector to handle a post-transform

Tue May 28 14:59:48 PDT 2013

Recall that it was a frequently-requested feature during the development of Collector to support an optional post-transform function, decoupling the intermediate accumulation state from the final result.  At the time, I took a swing at implementing this but all the options that made sense at the time added complexity or cost, but since then there have been some refinements to the Collector API that have made this more practical.  This message is in two parts: this one is about how to extend Collector, and the next about how this might affect the standard set of Collectors.  

Currently Collector looks like:

interface Collector<T, R> { 

    Supplier<R> resultSupplier();    // make a new container
    BiFunction<R, T, R> accumulator();   // incorporate one new value
    BinaryOperator<R> combiner();   // combine two containers
    Set<Characteristics> characteristics();

}

where the characteristics are drawn from an enum { CONCURRENT, UNORDERED, and STRICTLY_MUTATIVE }.  Each of these are pure optimizations that enable frameworks to to take advantage of known properties of the collector; if a framework ignores the characteristics, it should still be able to arrive at the correct result.  

The proposed post-transform function would take the final result (after accumulation for serial operation, after combination for parallel operation), and apply a final transform function to it.  Motivation for this feature include: 
 - Use of a different type for accumulation and result.  For example, use a StringBuilder to accumulate but then return a String when done.  
 - Allowing the Collector to impose invariants on the result that may not be efficiently maintainable by the accumulator function. 
 - Enable a Collector to return an immutable result even though mutation is integral to what collect() does.  

Adding a post-function is entirely straightforward, but there are a few disadvantages.  The first is that Collector acquires an extra type parameter; instead of input/output types, there is a third type for the intermediate type.  This adds somewhat to the API surface area.  

The other is a performance concern; for combinators like groupingBy(f, collector), we have a choice of ways to implement the post-function for each value of the resulting map, but none is perfect.  One option is to update the elements in-place with Map.replaceAll(); the other is to return a "view" map.  The former is O(n) in the number of map keys; the latter defers a potentially significant fraction of the collect() work until after the user thinks the collect() is finished (and may lead to redundant work if we don't cache the results of applying the post-function to a specific bucket.)  If the post function is the identity function, this is even worse; we're doing potentially a lot of work for a no-op.  

The addition of the characteristics allows us to identify explicitly when the post-transform is a no-op; have a characteristic flag for that.  

So Collector becomes:

interface Collector<T, I, R> { 

    Supplier<I> resultSupplier();    // make a new container
    BiFunction<I, T, I> accumulator();   // incorporate one new value
    BinaryOperator<I> combiner();   // combine two containers
    Function<I, R> transformer(); 
    Set<Characteristics> characteristics();

}

and the characteristic enum acquires IDENTITY_TRANSFORM.  This means that the cost can be completely eliminated if the feature is not used.  

What's bad?  

 - More generics in Collector signatures.  For Collectors that don't want to export their intermediate type, they are declared as Collector<T, ?, R>, which users may find disturbing. (The obvious attempts to make the extra type arg go away don't work.)  
 - Reliance on erasure.  For collectors like groupingBy() that take a Supplier<M>, we either need to take two suppliers (one for Map<K, I> and the other for Map<K, R>) or explicitly spec that the Map will be used to contain values of either I or R.  While this is not actually a problem for all the Map implementations in the JDK, it is kind of smelly.  (Don't bother raising the issue "and it won't work with reification"; the set of things that already don't is so large that ten more won't make it worse.)  This only shows up in the few Collector forms that take an explicit supplier argument; it is a pure implementation detail for the rest.  

Overall I think this is a reasonable price to pay for making the abstraction more powerful.