Collectors update redux

Thu Feb 7 10:56:00 PST 2013

On Thu, Feb 7, 2013 at 10:41 AM, Tim Peierls <tim at peierls.net> wrote:

Is three-arg collect really the target "on ramp"?

IF you've been successfully spoon-fed the excellent examples (bitset etc.)
then you can see it as reasonably simple. Otherwise you're pretty lost in
the woods.

> I would have thought the first stop would be the combinators. OTOH ...
> there's a lot of stuff in there.

I think there is *way* too much stuff in there, and I don't have enough
time to even review it all before it gets set in stone. I strongly believe
we would be smarter to keep the set of prepackaged Collectors much smaller
and let third-party libraries experiment with which Collectors to provide.*
* And, no, it's not that I *want* more code that Guava will have to build
and maintain. It just seems far safer and more appropriate.  JDK only needs
the big ones -- a few versions of groupingBy, a few others, done.

It's harder to leave out Stream methods, but these are just static things
anyone could provide.

> I can think of uses for all of it, but I worry about someone faced with
> picking the right static factory method of Collectors. Maybe with the right
> class comment, users can be guided to the right combinator without having
> to know much.
>
> --tim
>
>
>
> On Wed, Feb 6, 2013 at 5:12 PM, Brian Goetz <brian.goetz at oracle.com>wrote:
>
>> Did more tweaking with Collectors.
>>
>> Recall there are two basic forms of the collect method:
>>
>> The most basic one is the "on ramp", which doesn't require any
>> understanding of Collector or the combinators therein; it is basically the
>> mutable version of reduce.  It looks like:
>>
>>   collect(() -> R, (R,T) -> void, (R,R) -> void)
>>
>> The API shape is defined so that most invocations will work with method
>> references:
>>
>>   // To ArrayList
>>   collect(ArrayList::new, ArrayList::add, ArrayList::addAll)
>>
>> Note that this works in parallel too; we create list at the leaves with
>> ::add, and merge them up the tree with ::addAll.
>>
>>   // String concat
>>   collect(StringBuilder::new, StringBuilder::append,
>>           StringBuilder::append)
>>
>>   // Turn an int stream to a BitSet with those bits set
>>   collect(BitSet::new, BitSet::set, BitSet::or)
>>
>>   // String join with delimiter
>>   collect(() -> new StringJoiner(", "), StringJoiner::append,
>>           StringJoiner::append)
>>
>> Again, all these work in parallel.
>>
>> Digression: the various forms of reduce/etc form a ladder in terms of
>> complexity:
>>
>>   If you understand reduction, you can understand...
>>   ...reduce(T, BinaryOperator<T>)
>>
>>   If you understand the above + Optional, you can then understand...
>>   ...reduce(BinaryOperator<T>)
>>
>>   If you understand the above + "fold" (nonhomogeneous reduction), you
>> can then understand...
>>     ...reduce(U, BiFunction<U, T, U> accumulator, BinaryOperator<U>);
>>
>>   If you understand the above + "mutable fold" (inject), you can then
>> understand...
>>     ...collect(Supplier<R>, (R,T) -> void, (R,R) -> void)
>>
>>   If you understand the above + "Collector", you can then understand...
>>     ...collect(Collector)
>>
>> This is all supported by the principle of commensurate effort; learn a
>> little more, can do a little more.
>>
>> OK, exiting digression, moving to the end of the list, those that use
>> "canned" Collectors.
>>
>>   collect(Collector)
>>   collectUnordered(Collector)
>>
>> Collectors are basically a tuple of three lambdas and a boolean
>> indicating whether the Collector can handle concurrent insertion:
>>
>>   Collector<T,R> = { () -> R, (R,T) -> void, (R,R) -> R, isConcurrent }
>>
>> Note there is a slight difference in the last argument, a
>> BinaryOperator<R> rather than a BiConsumer<R,R>.  The BinaryOperator form
>> is more flexible (it can support appending two Lists into a tree
>> representation without copying the elements, whereas the (R,R) -> void form
>> can't.)  This asymmetry is a rough edge, though in each case, the shape is
>> "locally" optimal (in the three-arg version, the void form supports method
>> refs better; in the Collector version, the result is more flexible, and
>> that's where we need the flexibility.)  But we could make them consistent
>> at the cost of the above uses becoming more like:
>>
>>   collect(StringBuilder::new, StringBuilder::append,
>>           (l, r) -> { l.append(r); return l; })
>>
>> Overall I think the current API yields better client code at the cost of
>> this slightly rough edge.
>>
>>
>> The set of Collectors now includes:
>>   toCollection(Supplier<**Collection>)
>>   toList()
>>   toSet()
>>   toStringBuilder()
>>   toStringJoiner(delimiter)
>>
>>   // mapping combinators (plus primitive specializations)
>>   mapping(T->U, Collector<U>)
>>
>>   // Single-level groupBy
>>   groupingBy(T->K)
>>
>>   // groupBy with downstream Collector)
>>   groupingBy(T->K, Collector<T>)
>>
>>   // grouping plus reduce
>>   groupingReduce(T->K, BinaryOperator<T>) // reduce only
>>   groupingReduce(T->K, T->U, BinaryOperator<U>) // map+reduce
>>
>>   // join (nee mappedTo)
>>   joiningWith(T -> U) // produces Map<T,U>
>>
>>   // partition
>>   partitioningBy(Predicate<T>)
>>   partitioningBy(Predicate<T>, Collector<T>)
>>   partitioningReduce(Predicate<**T>, BinaryOperator<T>)
>>   partitioningReduce(Predicate<**T>, T->U, BinaryOperator<T>)
>>
>>   // statistics (gathers sum, count, min, max, average)
>>   toLongStatistics()
>>   toDoubleStatistics()
>>
>> Plus, concurrent versions of most of these (which are suitable for
>> unordered/contended/forEach-**style execution.)  Plus versions that let
>> you offer explicit constructors for maps and collections.  While these may
>> seem like a lot, the implementations are highly compact -- all of these
>> together, plus supporting machinery, fit in 500 LoC.
>>
>> These Collectors are designed around composibility.  (It is vaguely
>> frustrating that we even have to separate the "with downstream Collector"
>> versions from the reducing versions.)  So they each have a form where you
>> can do some level of categorization and then use a downstream collector to
>> do further computation.  This is very powerful.
>>
>> Examples, again using the familiar problem domain of transactions:
>>
>> class Txn {
>>     Buyer buyer();
>>     Seller seller();
>>     String description();
>>     int amount();
>> }
>>
>> Transactions by buyer:
>>
>>   Map<Buyer, Collection<Txn>>
>>     m = txns.collect(groupingBy(Txn::**buyer));
>>
>> Highest-dollar transaction by buyer:
>>   Map<Buyer, Transaction>
>>     m = txns.collect(
>>         groupingReduce(Txn::buyer,
>>                        Comparators.greaterOf(
>>                            Comparators.comparing(Txn::**amount)));
>>
>> Here, comparing() takes the Txn -> amount function, and produces a
>> Comparator<Txn>; greaterOf(comparator) turns that Comparator into a
>> BinaryOperator that corresponds to "max by comparator".  We then reduce on
>> that, yielding highest-dollar transaction per buyer.
>>
>> Alternately, if you want the number, not the transaction:
>>   Map<Buyer, Integer>
>>     m = txns.collect(groupingReduce(**Txn::buyer,
>>                                     Txn::amount, Integer::max));
>>
>> Transactions by buyer, seller:
>>   Map<Buyer, Map<Seller, Collection<Txn>>>
>>     m = txns.collect(groupingBy(Txn::**buyer, groupingBy(Txn::seller)));
>>
>> Transaction volume statistics by buyer, seller:
>>
>>   Map<Buyer, Map<Seller, LongStatistics>>
>>     m = txns.collect(groupingBy(Txn::**buyer,
>>                          groupingBy(Txn::seller,
>>                                     mapping(Txn::amount,
>>                                             toLongStatistics())));
>>
>> The statistics let you get at min, max, sum, count, and average from a
>> single pass on the data (this trick taken from ParallelArray.)
>>
>> We can mix and match at various levels.  For example:
>>
>> Transactions by buyer, partitioned int "large/small" groups:
>>
>>   Predicate<Txn> isLarge = t -> t.amount() > BIG;
>>   Map<Buyer, Map<Boolean, Collection<Transaction>>>
>>     m = txns.collect(groupingBy(Txn::**buyer, partitioningBy(isLarge)));
>>
>> Or, turning it around:
>>
>>   Map<Boolean, Map<Buyer, Collection<Transaction>>>
>>     m = txns.collect(partitioningBy(**isLarge, groupingBy(Txn::buyer)));
>>
>> Because Collector is public, Kevin can write and publish
>> Guava-multimap-bearing versions of these -- probably in about ten minutes.
>>
>>
>

-- 
Kevin Bourrillion | Java Librarian | Google, Inc. | kevinb at google.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/lambda-libs-spec-experts/attachments/20130207/0faeaa38/attachment.html