Extending Collector to handle a post-transform

Tue May 28 20:44:55 PDT 2013

Really glad you added this ability to have an intermediate type for
collectors, it is something I find very useful.

In my collectors for my stream library I didn't hide the intermediate type
since I found it useful to know what was under the covers, e.g. a
StringJoiner was doing the work. However if you did want to hide it to
reduce the surface area of the API then I *think* you can do:

<disclaimer>Completely untested code</disclaimer>

Instead of IDENTITY_TRANSFORM and CONCURRENT flags use types. The collect
methods accept a Collector and instance test for optimisations. Hence the
intermediate types are not exposed in the Stream API.

  interface Collector<T, R> {
    Supplier<R> resultSupplier();    // make a new container
    BiFunction<R, T, R> accumulator();   // incorporate one new value }
    default Set<Characteristics> characteristics() { return
Characteristics.EMPTY_SET; } // set of UNORDERED and/or STRICTLY_MUTATIVE
  }

  interface ConcurrentCollector<T, R> extends Collector<T, R> {
    BinaryOperator<R> combiner();   // combine two containers
  }

  interface TransformingCollector<T, I, R> extends Collector<T, I> {
    default Function<I, R> transformer(); { (i) -> (R)i; } // transform
from intermediate to result type
  }

  interface TransformingConcurrentCollector<T, I, R> extends
ConcurrentCollector<T, I>, TransformingCollector<T, I, R> {}

As I said at the start, a very welcome addition to the API.

Thanks,

 -- Howard.

On 29 May 2013 07:59, Brian Goetz <brian.goetz at oracle.com> wrote:

> Recall that it was a frequently-requested feature during the development
> of Collector to support an optional post-transform function, decoupling the
> intermediate accumulation state from the final result.  At the time, I took
> a swing at implementing this but all the options that made sense at the
> time added complexity or cost, but since then there have been some
> refinements to the Collector API that have made this more practical.  This
> message is in two parts: this one is about how to extend Collector, and the
> next about how this might affect the standard set of Collectors.
>
> Currently Collector looks like:
>
> interface Collector<T, R> {
>
>     Supplier<R> resultSupplier();    // make a new container
>     BiFunction<R, T, R> accumulator();   // incorporate one new value
>     BinaryOperator<R> combiner();   // combine two containers
>     Set<Characteristics> characteristics();
>
> }
>
> where the characteristics are drawn from an enum { CONCURRENT, UNORDERED,
> and STRICTLY_MUTATIVE }.  Each of these are pure optimizations that enable
> frameworks to to take advantage of known properties of the collector; if a
> framework ignores the characteristics, it should still be able to arrive at
> the correct result.
>
> The proposed post-transform function would take the final result (after
> accumulation for serial operation, after combination for parallel
> operation), and apply a final transform function to it.  Motivation for
> this feature include:
>  - Use of a different type for accumulation and result.  For example, use
> a StringBuilder to accumulate but then return a String when done.
>  - Allowing the Collector to impose invariants on the result that may not
> be efficiently maintainable by the accumulator function.
>  - Enable a Collector to return an immutable result even though mutation
> is integral to what collect() does.
>
> Adding a post-function is entirely straightforward, but there are a few
> disadvantages.  The first is that Collector acquires an extra type
> parameter; instead of input/output types, there is a third type for the
> intermediate type.  This adds somewhat to the API surface area.
>
> The other is a performance concern; for combinators like groupingBy(f,
> collector), we have a choice of ways to implement the post-function for
> each value of the resulting map, but none is perfect.  One option is to
> update the elements in-place with Map.replaceAll(); the other is to return
> a "view" map.  The former is O(n) in the number of map keys; the latter
> defers a potentially significant fraction of the collect() work until after
> the user thinks the collect() is finished (and may lead to redundant work
> if we don't cache the results of applying the post-function to a specific
> bucket.)  If the post function is the identity function, this is even
> worse; we're doing potentially a lot of work for a no-op.
>
> The addition of the characteristics allows us to identify explicitly when
> the post-transform is a no-op; have a characteristic flag for that.
>
> So Collector becomes:
>
> interface Collector<T, I, R> {
>
>     Supplier<I> resultSupplier();    // make a new container
>     BiFunction<I, T, I> accumulator();   // incorporate one new value
>     BinaryOperator<I> combiner();   // combine two containers
>     Function<I, R> transformer();
>     Set<Characteristics> characteristics();
>
> }
>
> and the characteristic enum acquires IDENTITY_TRANSFORM.  This means that
> the cost can be completely eliminated if the feature is not used.
>
> What's bad?
>
>  - More generics in Collector signatures.  For Collectors that don't want
> to export their intermediate type, they are declared as Collector<T, ?, R>,
> which users may find disturbing. (The obvious attempts to make the extra
> type arg go away don't work.)
>  - Reliance on erasure.  For collectors like groupingBy() that take a
> Supplier<M>, we either need to take two suppliers (one for Map<K, I> and
> the other for Map<K, R>) or explicitly spec that the Map will be used to
> contain values of either I or R.  While this is not actually a problem for
> all the Map implementations in the JDK, it is kind of smelly.  (Don't
> bother raising the issue "and it won't work with reification"; the set of
> things that already don't is so large that ten more won't make it worse.)
>  This only shows up in the few Collector forms that take an explicit
> supplier argument; it is a pure implementation detail for the rest.
>
> Overall I think this is a reasonable price to pay for making the
> abstraction more powerful.
>
>
>
>

-- 
  -- Howard.