Extending Collector to handle a post-transform
Howard Lovatt
howard.lovatt at gmail.com
Tue May 28 20:44:55 PDT 2013
Really glad you added this ability to have an intermediate type for
collectors, it is something I find very useful.
In my collectors for my stream library I didn't hide the intermediate type
since I found it useful to know what was under the covers, e.g. a
StringJoiner was doing the work. However if you did want to hide it to
reduce the surface area of the API then I *think* you can do:
<disclaimer>Completely untested code</disclaimer>
Instead of IDENTITY_TRANSFORM and CONCURRENT flags use types. The collect
methods accept a Collector and instance test for optimisations. Hence the
intermediate types are not exposed in the Stream API.
interface Collector<T, R> {
Supplier<R> resultSupplier(); // make a new container
BiFunction<R, T, R> accumulator(); // incorporate one new value }
default Set<Characteristics> characteristics() { return
Characteristics.EMPTY_SET; } // set of UNORDERED and/or STRICTLY_MUTATIVE
}
interface ConcurrentCollector<T, R> extends Collector<T, R> {
BinaryOperator<R> combiner(); // combine two containers
}
interface TransformingCollector<T, I, R> extends Collector<T, I> {
default Function<I, R> transformer(); { (i) -> (R)i; } // transform
from intermediate to result type
}
interface TransformingConcurrentCollector<T, I, R> extends
ConcurrentCollector<T, I>, TransformingCollector<T, I, R> {}
As I said at the start, a very welcome addition to the API.
Thanks,
-- Howard.
On 29 May 2013 07:59, Brian Goetz <brian.goetz at oracle.com> wrote:
> Recall that it was a frequently-requested feature during the development
> of Collector to support an optional post-transform function, decoupling the
> intermediate accumulation state from the final result. At the time, I took
> a swing at implementing this but all the options that made sense at the
> time added complexity or cost, but since then there have been some
> refinements to the Collector API that have made this more practical. This
> message is in two parts: this one is about how to extend Collector, and the
> next about how this might affect the standard set of Collectors.
>
> Currently Collector looks like:
>
> interface Collector<T, R> {
>
> Supplier<R> resultSupplier(); // make a new container
> BiFunction<R, T, R> accumulator(); // incorporate one new value
> BinaryOperator<R> combiner(); // combine two containers
> Set<Characteristics> characteristics();
>
> }
>
> where the characteristics are drawn from an enum { CONCURRENT, UNORDERED,
> and STRICTLY_MUTATIVE }. Each of these are pure optimizations that enable
> frameworks to to take advantage of known properties of the collector; if a
> framework ignores the characteristics, it should still be able to arrive at
> the correct result.
>
> The proposed post-transform function would take the final result (after
> accumulation for serial operation, after combination for parallel
> operation), and apply a final transform function to it. Motivation for
> this feature include:
> - Use of a different type for accumulation and result. For example, use
> a StringBuilder to accumulate but then return a String when done.
> - Allowing the Collector to impose invariants on the result that may not
> be efficiently maintainable by the accumulator function.
> - Enable a Collector to return an immutable result even though mutation
> is integral to what collect() does.
>
> Adding a post-function is entirely straightforward, but there are a few
> disadvantages. The first is that Collector acquires an extra type
> parameter; instead of input/output types, there is a third type for the
> intermediate type. This adds somewhat to the API surface area.
>
> The other is a performance concern; for combinators like groupingBy(f,
> collector), we have a choice of ways to implement the post-function for
> each value of the resulting map, but none is perfect. One option is to
> update the elements in-place with Map.replaceAll(); the other is to return
> a "view" map. The former is O(n) in the number of map keys; the latter
> defers a potentially significant fraction of the collect() work until after
> the user thinks the collect() is finished (and may lead to redundant work
> if we don't cache the results of applying the post-function to a specific
> bucket.) If the post function is the identity function, this is even
> worse; we're doing potentially a lot of work for a no-op.
>
> The addition of the characteristics allows us to identify explicitly when
> the post-transform is a no-op; have a characteristic flag for that.
>
> So Collector becomes:
>
> interface Collector<T, I, R> {
>
> Supplier<I> resultSupplier(); // make a new container
> BiFunction<I, T, I> accumulator(); // incorporate one new value
> BinaryOperator<I> combiner(); // combine two containers
> Function<I, R> transformer();
> Set<Characteristics> characteristics();
>
> }
>
> and the characteristic enum acquires IDENTITY_TRANSFORM. This means that
> the cost can be completely eliminated if the feature is not used.
>
> What's bad?
>
> - More generics in Collector signatures. For Collectors that don't want
> to export their intermediate type, they are declared as Collector<T, ?, R>,
> which users may find disturbing. (The obvious attempts to make the extra
> type arg go away don't work.)
> - Reliance on erasure. For collectors like groupingBy() that take a
> Supplier<M>, we either need to take two suppliers (one for Map<K, I> and
> the other for Map<K, R>) or explicitly spec that the Map will be used to
> contain values of either I or R. While this is not actually a problem for
> all the Map implementations in the JDK, it is kind of smelly. (Don't
> bother raising the issue "and it won't work with reification"; the set of
> things that already don't is so large that ten more won't make it worse.)
> This only shows up in the few Collector forms that take an explicit
> supplier argument; it is a pure implementation detail for the rest.
>
> Overall I think this is a reasonable price to pay for making the
> abstraction more powerful.
>
>
>
>
--
-- Howard.
More information about the lambda-libs-spec-observers
mailing list