Collectors -- finally!

Sat Mar 16 09:22:42 PDT 2013

I believe I have the last word on Collector.

Recall the overriding goal of this effort is to support *composibility*, 
so that "intermediate collecting stages" like groupBy / partition / 
mapping can be combined with other collections/reductions to let the 
user easily mix and match, rather than providing limited ad-hoc 
reductions like "groupBy".

In retrospect, the central tension in the API, which was causing in 
various versions API bloat and API complexity, was the fact that we were 
treating cascaded functional reduction and cascaded mutable reduction 
differently, when both are really 95% the same thing.  In the first 
version, it meant 16 forms of grouping{By,Reduce}, and half of those 
were due to reduction.  In the version from last week, this trouble came 
out as overloading the GroupingCollector as being both a Collector and a 
factory for more complex Collectors.

The answer, I believe, is to "detune" the Collector abstraction so that 
it can model either a functional or a mutable reduction.  This slightly 
increases the pain of implementing a Collector (but not really), and 
slightly-more-than-slightly increases the pain of implementing reduction 
atop a Collector -- but that's good because the only true client of 
Collector is the Streams implementation, and that's where the pain 
belongs.  By moving the pain to the core framework implementation, the 
user code gets simpler, there are fewer exposed concepts, and the 
explosion of combinations becomes more manageable.

Here's Collector now:

public interface Collector<T, R> {
     Supplier<R> resultSupplier();
     BiFunction<R, T, R> accumulator();
     BinaryOperator<R> combiner();
     default boolean isConcurrent() { return false; }
     default boolean isStable() { return false; }
}

API-wise, what's changed is accumulator returns a BiFunction, not a 
BiConsumer, raising the possibility that the accumulation operation 
could change the container.  This opens the doors to some more 
interesting Collector implementations, and makes it more parallel with 
the combiner, which we turned into a BiFunction in the first round of 
Collector.  The other new method is "isStable" (better name invited), 
which is merely an indication that this collector will act as an "old 
style" mutable Collector which opens the doors to some optimizations in 
the concurrent implementation.  (Ignore for now, its purely an 
optimization.)  Spec-wise, it gets more complicated because there's more 
things a Collector can do.  But again, that's mostly our problem.

What this means is that we can now (finally) define a Collector for 
reduction:

   Collector<T,T> reducing(BinaryOperator<T>)

Which means that half the forms (reduce, map-reduce) of the grouping and 
partitioning combinators go away, and instead just fold into the 
"cascaded collector" form.  Which leaves us with the following grouping 
forms:

   groupingBy(classifier)
   groupingBy(classifier, mapCtor)
   groupingBy(classifier, downstreamCollector)
   groupingBy(classifier, mapCtor, downstreamCollector)

along with groupingByConcurrent version of both.  This is still a few 
versions, but it should be clear enough how they differ.

The "max sale by salesman" example now becomes:

Map<Seller, Integer>
   txns.collect(groupingBy(Txn::seller,
                           mapping(Txn::amount,
                                   reducing(Integer::max)));

 From the previous version, the intermediate types 
GroupingCollector/PartitionCollector go away, as does the unfortunate 
type fudgery with the map constructors.  This is basically like the 
original version, but with half the groupingBy forms replaced with a 
single reducing() form.

The Collectors inventory now stands at:

  - toList()
  - toSet()
  - toCollection(ctor)
  - toStringBuilder()
  - toStringJoiner(sep)
  - to{Int,Long,Double}Statistics

  - toMap(mappingFn)
  - toMap(mappingFn, mapCtor, mergeFn)
  - toConcurrentMap(mappingFn)
  - toConcurrentMap(mappingFn, mapCtor, mergeFn)

  - mapping(fn, collector) // plus primitive specializations
  - reducing(BinaryOperator) // plus primitive specializations

  - groupingBy(classifier)
  - groupingBy(classifier, mapCtor)
  - groupingBy(classifier, downstreamCollector)
  - groupingBy(classifier, mapCtor, downstreamCollector)
  - groupingByConcurrent(classifier)
  - groupingByConcurrent(classifier, mapCtor)
  - groupingByConcurrent(classifier, downstreamCollector)
  - groupingByConcurrent(classifier, mapCtor, downstreamCollector)

  - partitioningBy(predicate)

The more flexible Collector API gives us new opportunities, too.  For 
example, toList used to use exclusively ArrayList.  But this version is 
more memory efficient:

     public static<T>
     Collector<T, List<T>> toList() {
         BiFunction<List<T>, T, List<T>> accumulator = (list, t) -> {
             int s = list.size();
             if (s == 0)
                 return Collections.singletonList(t);
             else if (s == 1) {
                 List<T> newList = new ArrayList<>();
                 newList.add(list.get(0));
                 newList.add(t);
                 return newList;
             }
             else {
                 list.add(t);
                 return list;
             }
         };
         BinaryOperator<List<T>> combiner = (left, right) -> {
             if (left.size() > 1) {
                 left.addAll(right);
                 return left;
             }
             else {
                 List<T> newList = new ArrayList<>(left.size() + 
right.size());
                 newList.addAll(left);
                 newList.addAll(right);
                 return newList;
             }
         };
         return new CollectorImpl<>(Collections::emptyList, accumulator, 
combiner, false, false);
     }