Collectors update redux
Kevin Bourrillion
kevinb at google.com
Thu Feb 7 10:56:00 PST 2013
On Thu, Feb 7, 2013 at 10:41 AM, Tim Peierls <tim at peierls.net> wrote:
Is three-arg collect really the target "on ramp"?
IF you've been successfully spoon-fed the excellent examples (bitset etc.)
then you can see it as reasonably simple. Otherwise you're pretty lost in
the woods.
> I would have thought the first stop would be the combinators. OTOH ...
> there's a lot of stuff in there.
I think there is *way* too much stuff in there, and I don't have enough
time to even review it all before it gets set in stone. I strongly believe
we would be smarter to keep the set of prepackaged Collectors much smaller
and let third-party libraries experiment with which Collectors to provide.*
* And, no, it's not that I *want* more code that Guava will have to build
and maintain. It just seems far safer and more appropriate. JDK only needs
the big ones -- a few versions of groupingBy, a few others, done.
It's harder to leave out Stream methods, but these are just static things
anyone could provide.
> I can think of uses for all of it, but I worry about someone faced with
> picking the right static factory method of Collectors. Maybe with the right
> class comment, users can be guided to the right combinator without having
> to know much.
>
> --tim
>
>
>
> On Wed, Feb 6, 2013 at 5:12 PM, Brian Goetz <brian.goetz at oracle.com>wrote:
>
>> Did more tweaking with Collectors.
>>
>> Recall there are two basic forms of the collect method:
>>
>> The most basic one is the "on ramp", which doesn't require any
>> understanding of Collector or the combinators therein; it is basically the
>> mutable version of reduce. It looks like:
>>
>> collect(() -> R, (R,T) -> void, (R,R) -> void)
>>
>> The API shape is defined so that most invocations will work with method
>> references:
>>
>> // To ArrayList
>> collect(ArrayList::new, ArrayList::add, ArrayList::addAll)
>>
>> Note that this works in parallel too; we create list at the leaves with
>> ::add, and merge them up the tree with ::addAll.
>>
>> // String concat
>> collect(StringBuilder::new, StringBuilder::append,
>> StringBuilder::append)
>>
>> // Turn an int stream to a BitSet with those bits set
>> collect(BitSet::new, BitSet::set, BitSet::or)
>>
>> // String join with delimiter
>> collect(() -> new StringJoiner(", "), StringJoiner::append,
>> StringJoiner::append)
>>
>> Again, all these work in parallel.
>>
>> Digression: the various forms of reduce/etc form a ladder in terms of
>> complexity:
>>
>> If you understand reduction, you can understand...
>> ...reduce(T, BinaryOperator<T>)
>>
>> If you understand the above + Optional, you can then understand...
>> ...reduce(BinaryOperator<T>)
>>
>> If you understand the above + "fold" (nonhomogeneous reduction), you
>> can then understand...
>> ...reduce(U, BiFunction<U, T, U> accumulator, BinaryOperator<U>);
>>
>> If you understand the above + "mutable fold" (inject), you can then
>> understand...
>> ...collect(Supplier<R>, (R,T) -> void, (R,R) -> void)
>>
>> If you understand the above + "Collector", you can then understand...
>> ...collect(Collector)
>>
>> This is all supported by the principle of commensurate effort; learn a
>> little more, can do a little more.
>>
>> OK, exiting digression, moving to the end of the list, those that use
>> "canned" Collectors.
>>
>> collect(Collector)
>> collectUnordered(Collector)
>>
>> Collectors are basically a tuple of three lambdas and a boolean
>> indicating whether the Collector can handle concurrent insertion:
>>
>> Collector<T,R> = { () -> R, (R,T) -> void, (R,R) -> R, isConcurrent }
>>
>> Note there is a slight difference in the last argument, a
>> BinaryOperator<R> rather than a BiConsumer<R,R>. The BinaryOperator form
>> is more flexible (it can support appending two Lists into a tree
>> representation without copying the elements, whereas the (R,R) -> void form
>> can't.) This asymmetry is a rough edge, though in each case, the shape is
>> "locally" optimal (in the three-arg version, the void form supports method
>> refs better; in the Collector version, the result is more flexible, and
>> that's where we need the flexibility.) But we could make them consistent
>> at the cost of the above uses becoming more like:
>>
>> collect(StringBuilder::new, StringBuilder::append,
>> (l, r) -> { l.append(r); return l; })
>>
>> Overall I think the current API yields better client code at the cost of
>> this slightly rough edge.
>>
>>
>> The set of Collectors now includes:
>> toCollection(Supplier<**Collection>)
>> toList()
>> toSet()
>> toStringBuilder()
>> toStringJoiner(delimiter)
>>
>> // mapping combinators (plus primitive specializations)
>> mapping(T->U, Collector<U>)
>>
>> // Single-level groupBy
>> groupingBy(T->K)
>>
>> // groupBy with downstream Collector)
>> groupingBy(T->K, Collector<T>)
>>
>> // grouping plus reduce
>> groupingReduce(T->K, BinaryOperator<T>) // reduce only
>> groupingReduce(T->K, T->U, BinaryOperator<U>) // map+reduce
>>
>> // join (nee mappedTo)
>> joiningWith(T -> U) // produces Map<T,U>
>>
>> // partition
>> partitioningBy(Predicate<T>)
>> partitioningBy(Predicate<T>, Collector<T>)
>> partitioningReduce(Predicate<**T>, BinaryOperator<T>)
>> partitioningReduce(Predicate<**T>, T->U, BinaryOperator<T>)
>>
>> // statistics (gathers sum, count, min, max, average)
>> toLongStatistics()
>> toDoubleStatistics()
>>
>> Plus, concurrent versions of most of these (which are suitable for
>> unordered/contended/forEach-**style execution.) Plus versions that let
>> you offer explicit constructors for maps and collections. While these may
>> seem like a lot, the implementations are highly compact -- all of these
>> together, plus supporting machinery, fit in 500 LoC.
>>
>> These Collectors are designed around composibility. (It is vaguely
>> frustrating that we even have to separate the "with downstream Collector"
>> versions from the reducing versions.) So they each have a form where you
>> can do some level of categorization and then use a downstream collector to
>> do further computation. This is very powerful.
>>
>> Examples, again using the familiar problem domain of transactions:
>>
>> class Txn {
>> Buyer buyer();
>> Seller seller();
>> String description();
>> int amount();
>> }
>>
>> Transactions by buyer:
>>
>> Map<Buyer, Collection<Txn>>
>> m = txns.collect(groupingBy(Txn::**buyer));
>>
>> Highest-dollar transaction by buyer:
>> Map<Buyer, Transaction>
>> m = txns.collect(
>> groupingReduce(Txn::buyer,
>> Comparators.greaterOf(
>> Comparators.comparing(Txn::**amount)));
>>
>> Here, comparing() takes the Txn -> amount function, and produces a
>> Comparator<Txn>; greaterOf(comparator) turns that Comparator into a
>> BinaryOperator that corresponds to "max by comparator". We then reduce on
>> that, yielding highest-dollar transaction per buyer.
>>
>> Alternately, if you want the number, not the transaction:
>> Map<Buyer, Integer>
>> m = txns.collect(groupingReduce(**Txn::buyer,
>> Txn::amount, Integer::max));
>>
>> Transactions by buyer, seller:
>> Map<Buyer, Map<Seller, Collection<Txn>>>
>> m = txns.collect(groupingBy(Txn::**buyer, groupingBy(Txn::seller)));
>>
>> Transaction volume statistics by buyer, seller:
>>
>> Map<Buyer, Map<Seller, LongStatistics>>
>> m = txns.collect(groupingBy(Txn::**buyer,
>> groupingBy(Txn::seller,
>> mapping(Txn::amount,
>> toLongStatistics())));
>>
>> The statistics let you get at min, max, sum, count, and average from a
>> single pass on the data (this trick taken from ParallelArray.)
>>
>> We can mix and match at various levels. For example:
>>
>> Transactions by buyer, partitioned int "large/small" groups:
>>
>> Predicate<Txn> isLarge = t -> t.amount() > BIG;
>> Map<Buyer, Map<Boolean, Collection<Transaction>>>
>> m = txns.collect(groupingBy(Txn::**buyer, partitioningBy(isLarge)));
>>
>> Or, turning it around:
>>
>> Map<Boolean, Map<Buyer, Collection<Transaction>>>
>> m = txns.collect(partitioningBy(**isLarge, groupingBy(Txn::buyer)));
>>
>> Because Collector is public, Kevin can write and publish
>> Guava-multimap-bearing versions of these -- probably in about ten minutes.
>>
>>
>
--
Kevin Bourrillion | Java Librarian | Google, Inc. | kevinb at google.com
More information about the lambda-libs-spec-observers
mailing list