Collectors inventory

Brian Goetz brian.goetz at oracle.com
Sun Mar 10 14:20:14 PDT 2013


OK, I've revamped Collectors in a way that may avoid the overload that 
Kevin, Remi, and Joe were concerned about.  At the same time, I've 
integrated concurrent collection into the model in a more obvious way.

The key problem is grouping-by.  There are essentially sixteen forms of 
groupingBy:
   { concurrent, not }
x { with explicit Map constructors, not }
x { simple group-by, cascaded group-by (downstream collector),
     simple reduce, map-reduce }

Its pretty hard to argue than any of these dimensions can be obviously 
jettisoned.  And simply pruning around the edges (e.g., "get rid of this 
variant") doesn't do the job.  Nor does "only provide the most general 
form", which guarantees that no one will be able to use it at all.

With the help of Don and his team last week, I came up with an alternate 
framing for groupingBy (and also partitioningBy, which has the same 
problems).  The key is to introduce an additional type, call it 
GroupingCollector, off of which we can hang some of the variants, and 
this lets us reduce the number of top-level collectors.

The current inventory, under this scheme (which I'll check in soon) is:

  - to{Collection,List,Set}
  - toString{Builder,Joiner}
  - to{Int,Long,Double}Statistics

  - toMap(mappingFn)               // was mappedTo
  - toMap(mappingFn, mapCtor)
  - toConcurrentMap(mappingFn)     // was ConcurrentCollectors.mappedTo
  - toConcurrentMap(mappingFn, mapCtor)

  - mapping(mappingFn, downstreamCollector) // plus primitive forms

  - groupingBy(classifierFn)
  - groupingBy(classifierFn, mapCtor)
  - groupingByConcurrent(classifierFn)
  - groupingByConcurrent(classifierFn, mapCtor)

  - partitioningBy(predicate)
  - partitioningByConcurrent(predicate)

This is a significant reduction in top-level forms -- we drop from 16 
groupingXxx forms to four, a similar reduction for partitioning forms, 
and -- most importantly ConcurrentCollectors *just goes away*.

Where it moves to is that the return type of groupingBy gets more 
complicated.  Instead of returning a simple Collector, it returns a 
GroupingCollector.  In its current form, GroupingCollector implements 
Collector -- meaning you can use groupingBy(f) as a plain collector -- 
but the more advanced forms (cascading, reducing) are hanging as extra 
methods off the GroupingCollector.

For example:

// Simple form -- people by city
Map<City, Collection<Person>> m
     = people.stream().collect(groupingBy(Person::getCity));

// Two-level form -- people by state, city
//     Uses .then(otherCollector) method
Map<State, Map<City, Collection<Person>>> m
     = people.stream()
             .collect(groupingBy(Person::getState)
                      .then(groupingBy(Person::getCity)));

// Reducing form -- count of people by city
//     Uses .thenReducing(mapper, reducer) method
Map<City, Integer> m
     = people.stream()
             .collect(groupingBy(Person::getState)
                      .thenReducing(p -> 1, Integer::sum));

The methods that appear on GroupingCollector are:
   .then(Collector downstream) -- cascaded groupBy
   .thenReducing(BinaryOperator<T>) -- reduce
   .thenReducing(Function<T,U>, BinaryOperator<U>) -- map/reduce

Partitioning is similar except the thenReducing methods need an identity 
argument too.


public static interface GroupingCollector<T, K>
     extends Collector<T, Map<K, Collection<T>>> {

     <D> Collector<T, Map<K, D>> then(Collector<T, D> downstream);

     Collector<T, Map<K, T>> thenReducing(BinaryOperator<T> reducer);

     <U> Collector<T, Map<K, U>> thenReducing(Function<? super T, ? 
extends U> mapper,
                                              BinaryOperator<U> reducer);
     }
}

The slightly weird thing about this is that a GroupingCollector is both 
a Collector (for the simple form) and a factory for collectors (for the 
cascaded forms).  This makes the user code better (a simple group by is 
just collecting(groupingBy(f))), but makes the type harder to 
understand.  We can adjust this tradeoff by severing the "extends 
Collector" and adding another method for "get me a simple collector", 
but I'm not sure this is an improvement.  This would probably look like:

   groupingBy(fn).toList()

or some such.

One variant we did jettison is the one where you provide an explicit 
Collection ctor, so you could group into a Set<Person> instead of a 
List<Person>.  (You can still get this with 
groupingBy(f).then(toCollection(ctor)).  If we did the above 
transformation, this could come back as:

   groupingBy(fn).toCollection(ctor)

or some such.

Overall this seems a much more approachable set of Collectors.  Still a 
few fine details to work out, including:
  - Does "GroupingCollector extends Collector" simplify or complicate?
  - Naming of everything
  - Do we want to add back the "grouping to explicit collection" form.




More information about the lambda-libs-spec-experts mailing list