Collectors update redux

Wed Feb 6 14:12:15 PST 2013

Did more tweaking with Collectors.

Recall there are two basic forms of the collect method:

The most basic one is the "on ramp", which doesn't require any 
understanding of Collector or the combinators therein; it is basically 
the mutable version of reduce.  It looks like:

   collect(() -> R, (R,T) -> void, (R,R) -> void)

The API shape is defined so that most invocations will work with method 
references:

   // To ArrayList
   collect(ArrayList::new, ArrayList::add, ArrayList::addAll)

Note that this works in parallel too; we create list at the leaves with 
::add, and merge them up the tree with ::addAll.

   // String concat
   collect(StringBuilder::new, StringBuilder::append,
           StringBuilder::append)

   // Turn an int stream to a BitSet with those bits set
   collect(BitSet::new, BitSet::set, BitSet::or)

   // String join with delimiter
   collect(() -> new StringJoiner(", "), StringJoiner::append,
           StringJoiner::append)

Again, all these work in parallel.

Digression: the various forms of reduce/etc form a ladder in terms of 
complexity:

   If you understand reduction, you can understand...
   ...reduce(T, BinaryOperator<T>)

   If you understand the above + Optional, you can then understand...
   ...reduce(BinaryOperator<T>)

   If you understand the above + "fold" (nonhomogeneous reduction), you 
can then understand...
     ...reduce(U, BiFunction<U, T, U> accumulator, BinaryOperator<U>);

   If you understand the above + "mutable fold" (inject), you can then 
understand...
     ...collect(Supplier<R>, (R,T) -> void, (R,R) -> void)

   If you understand the above + "Collector", you can then understand...
     ...collect(Collector)

This is all supported by the principle of commensurate effort; learn a 
little more, can do a little more.

OK, exiting digression, moving to the end of the list, those that use 
"canned" Collectors.

   collect(Collector)
   collectUnordered(Collector)

Collectors are basically a tuple of three lambdas and a boolean 
indicating whether the Collector can handle concurrent insertion:

   Collector<T,R> = { () -> R, (R,T) -> void, (R,R) -> R, isConcurrent }

Note there is a slight difference in the last argument, a 
BinaryOperator<R> rather than a BiConsumer<R,R>.  The BinaryOperator 
form is more flexible (it can support appending two Lists into a tree 
representation without copying the elements, whereas the (R,R) -> void 
form can't.)  This asymmetry is a rough edge, though in each case, the 
shape is "locally" optimal (in the three-arg version, the void form 
supports method refs better; in the Collector version, the result is 
more flexible, and that's where we need the flexibility.)  But we could 
make them consistent at the cost of the above uses becoming more like:

   collect(StringBuilder::new, StringBuilder::append,
           (l, r) -> { l.append(r); return l; })

Overall I think the current API yields better client code at the cost of 
this slightly rough edge.

The set of Collectors now includes:
   toCollection(Supplier<Collection>)
   toList()
   toSet()
   toStringBuilder()
   toStringJoiner(delimiter)

   // mapping combinators (plus primitive specializations)
   mapping(T->U, Collector<U>)

   // Single-level groupBy
   groupingBy(T->K)

   // groupBy with downstream Collector)
   groupingBy(T->K, Collector<T>)

   // grouping plus reduce
   groupingReduce(T->K, BinaryOperator<T>) // reduce only
   groupingReduce(T->K, T->U, BinaryOperator<U>) // map+reduce

   // join (nee mappedTo)
   joiningWith(T -> U) // produces Map<T,U>

   // partition
   partitioningBy(Predicate<T>)
   partitioningBy(Predicate<T>, Collector<T>)
   partitioningReduce(Predicate<T>, BinaryOperator<T>)
   partitioningReduce(Predicate<T>, T->U, BinaryOperator<T>)

   // statistics (gathers sum, count, min, max, average)
   toLongStatistics()
   toDoubleStatistics()

Plus, concurrent versions of most of these (which are suitable for 
unordered/contended/forEach-style execution.)  Plus versions that let 
you offer explicit constructors for maps and collections.  While these 
may seem like a lot, the implementations are highly compact -- all of 
these together, plus supporting machinery, fit in 500 LoC.

These Collectors are designed around composibility.  (It is vaguely 
frustrating that we even have to separate the "with downstream 
Collector" versions from the reducing versions.)  So they each have a 
form where you can do some level of categorization and then use a 
downstream collector to do further computation.  This is very powerful.

Examples, again using the familiar problem domain of transactions:

class Txn {
     Buyer buyer();
     Seller seller();
     String description();
     int amount();
}

Transactions by buyer:

   Map<Buyer, Collection<Txn>>
     m = txns.collect(groupingBy(Txn::buyer));

Highest-dollar transaction by buyer:
   Map<Buyer, Transaction>
     m = txns.collect(
         groupingReduce(Txn::buyer,
                        Comparators.greaterOf(
                            Comparators.comparing(Txn::amount)));

Here, comparing() takes the Txn -> amount function, and produces a 
Comparator<Txn>; greaterOf(comparator) turns that Comparator into a 
BinaryOperator that corresponds to "max by comparator".  We then reduce 
on that, yielding highest-dollar transaction per buyer.

Alternately, if you want the number, not the transaction:
   Map<Buyer, Integer>
     m = txns.collect(groupingReduce(Txn::buyer,
                                     Txn::amount, Integer::max));

Transactions by buyer, seller:
   Map<Buyer, Map<Seller, Collection<Txn>>>
     m = txns.collect(groupingBy(Txn::buyer, groupingBy(Txn::seller)));

Transaction volume statistics by buyer, seller:

   Map<Buyer, Map<Seller, LongStatistics>>
     m = txns.collect(groupingBy(Txn::buyer,
                          groupingBy(Txn::seller,
                                     mapping(Txn::amount,
                                             toLongStatistics())));

The statistics let you get at min, max, sum, count, and average from a 
single pass on the data (this trick taken from ParallelArray.)

We can mix and match at various levels.  For example:

Transactions by buyer, partitioned int "large/small" groups:

   Predicate<Txn> isLarge = t -> t.amount() > BIG;
   Map<Buyer, Map<Boolean, Collection<Transaction>>>
     m = txns.collect(groupingBy(Txn::buyer, partitioningBy(isLarge)));

Or, turning it around:

   Map<Boolean, Map<Buyer, Collection<Transaction>>>
     m = txns.collect(partitioningBy(isLarge, groupingBy(Txn::buyer)));

Because Collector is public, Kevin can write and publish 
Guava-multimap-bearing versions of these -- probably in about ten minutes.