Collectors update redux
Brian Goetz
brian.goetz at oracle.com
Wed Feb 6 14:12:15 PST 2013
Did more tweaking with Collectors.
Recall there are two basic forms of the collect method:
The most basic one is the "on ramp", which doesn't require any
understanding of Collector or the combinators therein; it is basically
the mutable version of reduce. It looks like:
collect(() -> R, (R,T) -> void, (R,R) -> void)
The API shape is defined so that most invocations will work with method
references:
// To ArrayList
collect(ArrayList::new, ArrayList::add, ArrayList::addAll)
Note that this works in parallel too; we create list at the leaves with
::add, and merge them up the tree with ::addAll.
// String concat
collect(StringBuilder::new, StringBuilder::append,
StringBuilder::append)
// Turn an int stream to a BitSet with those bits set
collect(BitSet::new, BitSet::set, BitSet::or)
// String join with delimiter
collect(() -> new StringJoiner(", "), StringJoiner::append,
StringJoiner::append)
Again, all these work in parallel.
Digression: the various forms of reduce/etc form a ladder in terms of
complexity:
If you understand reduction, you can understand...
...reduce(T, BinaryOperator<T>)
If you understand the above + Optional, you can then understand...
...reduce(BinaryOperator<T>)
If you understand the above + "fold" (nonhomogeneous reduction), you
can then understand...
...reduce(U, BiFunction<U, T, U> accumulator, BinaryOperator<U>);
If you understand the above + "mutable fold" (inject), you can then
understand...
...collect(Supplier<R>, (R,T) -> void, (R,R) -> void)
If you understand the above + "Collector", you can then understand...
...collect(Collector)
This is all supported by the principle of commensurate effort; learn a
little more, can do a little more.
OK, exiting digression, moving to the end of the list, those that use
"canned" Collectors.
collect(Collector)
collectUnordered(Collector)
Collectors are basically a tuple of three lambdas and a boolean
indicating whether the Collector can handle concurrent insertion:
Collector<T,R> = { () -> R, (R,T) -> void, (R,R) -> R, isConcurrent }
Note there is a slight difference in the last argument, a
BinaryOperator<R> rather than a BiConsumer<R,R>. The BinaryOperator
form is more flexible (it can support appending two Lists into a tree
representation without copying the elements, whereas the (R,R) -> void
form can't.) This asymmetry is a rough edge, though in each case, the
shape is "locally" optimal (in the three-arg version, the void form
supports method refs better; in the Collector version, the result is
more flexible, and that's where we need the flexibility.) But we could
make them consistent at the cost of the above uses becoming more like:
collect(StringBuilder::new, StringBuilder::append,
(l, r) -> { l.append(r); return l; })
Overall I think the current API yields better client code at the cost of
this slightly rough edge.
The set of Collectors now includes:
toCollection(Supplier<Collection>)
toList()
toSet()
toStringBuilder()
toStringJoiner(delimiter)
// mapping combinators (plus primitive specializations)
mapping(T->U, Collector<U>)
// Single-level groupBy
groupingBy(T->K)
// groupBy with downstream Collector)
groupingBy(T->K, Collector<T>)
// grouping plus reduce
groupingReduce(T->K, BinaryOperator<T>) // reduce only
groupingReduce(T->K, T->U, BinaryOperator<U>) // map+reduce
// join (nee mappedTo)
joiningWith(T -> U) // produces Map<T,U>
// partition
partitioningBy(Predicate<T>)
partitioningBy(Predicate<T>, Collector<T>)
partitioningReduce(Predicate<T>, BinaryOperator<T>)
partitioningReduce(Predicate<T>, T->U, BinaryOperator<T>)
// statistics (gathers sum, count, min, max, average)
toLongStatistics()
toDoubleStatistics()
Plus, concurrent versions of most of these (which are suitable for
unordered/contended/forEach-style execution.) Plus versions that let
you offer explicit constructors for maps and collections. While these
may seem like a lot, the implementations are highly compact -- all of
these together, plus supporting machinery, fit in 500 LoC.
These Collectors are designed around composibility. (It is vaguely
frustrating that we even have to separate the "with downstream
Collector" versions from the reducing versions.) So they each have a
form where you can do some level of categorization and then use a
downstream collector to do further computation. This is very powerful.
Examples, again using the familiar problem domain of transactions:
class Txn {
Buyer buyer();
Seller seller();
String description();
int amount();
}
Transactions by buyer:
Map<Buyer, Collection<Txn>>
m = txns.collect(groupingBy(Txn::buyer));
Highest-dollar transaction by buyer:
Map<Buyer, Transaction>
m = txns.collect(
groupingReduce(Txn::buyer,
Comparators.greaterOf(
Comparators.comparing(Txn::amount)));
Here, comparing() takes the Txn -> amount function, and produces a
Comparator<Txn>; greaterOf(comparator) turns that Comparator into a
BinaryOperator that corresponds to "max by comparator". We then reduce
on that, yielding highest-dollar transaction per buyer.
Alternately, if you want the number, not the transaction:
Map<Buyer, Integer>
m = txns.collect(groupingReduce(Txn::buyer,
Txn::amount, Integer::max));
Transactions by buyer, seller:
Map<Buyer, Map<Seller, Collection<Txn>>>
m = txns.collect(groupingBy(Txn::buyer, groupingBy(Txn::seller)));
Transaction volume statistics by buyer, seller:
Map<Buyer, Map<Seller, LongStatistics>>
m = txns.collect(groupingBy(Txn::buyer,
groupingBy(Txn::seller,
mapping(Txn::amount,
toLongStatistics())));
The statistics let you get at min, max, sum, count, and average from a
single pass on the data (this trick taken from ParallelArray.)
We can mix and match at various levels. For example:
Transactions by buyer, partitioned int "large/small" groups:
Predicate<Txn> isLarge = t -> t.amount() > BIG;
Map<Buyer, Map<Boolean, Collection<Transaction>>>
m = txns.collect(groupingBy(Txn::buyer, partitioningBy(isLarge)));
Or, turning it around:
Map<Boolean, Map<Buyer, Collection<Transaction>>>
m = txns.collect(partitioningBy(isLarge, groupingBy(Txn::buyer)));
Because Collector is public, Kevin can write and publish
Guava-multimap-bearing versions of these -- probably in about ten minutes.
More information about the lambda-libs-spec-observers
mailing list