Collectors update redux

Thu Feb 7 10:41:10 PST 2013

Is three-arg collect really the target "on ramp"? I would have thought the
first stop would be the combinators. OTOH ... there's a lot of stuff in
there. I can think of uses for all of it, but I worry about someone faced
with picking the right static factory method of Collectors. Maybe with the
right class comment, users can be guided to the right combinator without
having to know much.

--tim

On Wed, Feb 6, 2013 at 5:12 PM, Brian Goetz <brian.goetz at oracle.com> wrote:

> Did more tweaking with Collectors.
>
> Recall there are two basic forms of the collect method:
>
> The most basic one is the "on ramp", which doesn't require any
> understanding of Collector or the combinators therein; it is basically the
> mutable version of reduce.  It looks like:
>
>   collect(() -> R, (R,T) -> void, (R,R) -> void)
>
> The API shape is defined so that most invocations will work with method
> references:
>
>   // To ArrayList
>   collect(ArrayList::new, ArrayList::add, ArrayList::addAll)
>
> Note that this works in parallel too; we create list at the leaves with
> ::add, and merge them up the tree with ::addAll.
>
>   // String concat
>   collect(StringBuilder::new, StringBuilder::append,
>           StringBuilder::append)
>
>   // Turn an int stream to a BitSet with those bits set
>   collect(BitSet::new, BitSet::set, BitSet::or)
>
>   // String join with delimiter
>   collect(() -> new StringJoiner(", "), StringJoiner::append,
>           StringJoiner::append)
>
> Again, all these work in parallel.
>
> Digression: the various forms of reduce/etc form a ladder in terms of
> complexity:
>
>   If you understand reduction, you can understand...
>   ...reduce(T, BinaryOperator<T>)
>
>   If you understand the above + Optional, you can then understand...
>   ...reduce(BinaryOperator<T>)
>
>   If you understand the above + "fold" (nonhomogeneous reduction), you can
> then understand...
>     ...reduce(U, BiFunction<U, T, U> accumulator, BinaryOperator<U>);
>
>   If you understand the above + "mutable fold" (inject), you can then
> understand...
>     ...collect(Supplier<R>, (R,T) -> void, (R,R) -> void)
>
>   If you understand the above + "Collector", you can then understand...
>     ...collect(Collector)
>
> This is all supported by the principle of commensurate effort; learn a
> little more, can do a little more.
>
> OK, exiting digression, moving to the end of the list, those that use
> "canned" Collectors.
>
>   collect(Collector)
>   collectUnordered(Collector)
>
> Collectors are basically a tuple of three lambdas and a boolean indicating
> whether the Collector can handle concurrent insertion:
>
>   Collector<T,R> = { () -> R, (R,T) -> void, (R,R) -> R, isConcurrent }
>
> Note there is a slight difference in the last argument, a
> BinaryOperator<R> rather than a BiConsumer<R,R>.  The BinaryOperator form
> is more flexible (it can support appending two Lists into a tree
> representation without copying the elements, whereas the (R,R) -> void form
> can't.)  This asymmetry is a rough edge, though in each case, the shape is
> "locally" optimal (in the three-arg version, the void form supports method
> refs better; in the Collector version, the result is more flexible, and
> that's where we need the flexibility.)  But we could make them consistent
> at the cost of the above uses becoming more like:
>
>   collect(StringBuilder::new, StringBuilder::append,
>           (l, r) -> { l.append(r); return l; })
>
> Overall I think the current API yields better client code at the cost of
> this slightly rough edge.
>
>
> The set of Collectors now includes:
>   toCollection(Supplier<**Collection>)
>   toList()
>   toSet()
>   toStringBuilder()
>   toStringJoiner(delimiter)
>
>   // mapping combinators (plus primitive specializations)
>   mapping(T->U, Collector<U>)
>
>   // Single-level groupBy
>   groupingBy(T->K)
>
>   // groupBy with downstream Collector)
>   groupingBy(T->K, Collector<T>)
>
>   // grouping plus reduce
>   groupingReduce(T->K, BinaryOperator<T>) // reduce only
>   groupingReduce(T->K, T->U, BinaryOperator<U>) // map+reduce
>
>   // join (nee mappedTo)
>   joiningWith(T -> U) // produces Map<T,U>
>
>   // partition
>   partitioningBy(Predicate<T>)
>   partitioningBy(Predicate<T>, Collector<T>)
>   partitioningReduce(Predicate<**T>, BinaryOperator<T>)
>   partitioningReduce(Predicate<**T>, T->U, BinaryOperator<T>)
>
>   // statistics (gathers sum, count, min, max, average)
>   toLongStatistics()
>   toDoubleStatistics()
>
> Plus, concurrent versions of most of these (which are suitable for
> unordered/contended/forEach-**style execution.)  Plus versions that let
> you offer explicit constructors for maps and collections.  While these may
> seem like a lot, the implementations are highly compact -- all of these
> together, plus supporting machinery, fit in 500 LoC.
>
> These Collectors are designed around composibility.  (It is vaguely
> frustrating that we even have to separate the "with downstream Collector"
> versions from the reducing versions.)  So they each have a form where you
> can do some level of categorization and then use a downstream collector to
> do further computation.  This is very powerful.
>
> Examples, again using the familiar problem domain of transactions:
>
> class Txn {
>     Buyer buyer();
>     Seller seller();
>     String description();
>     int amount();
> }
>
> Transactions by buyer:
>
>   Map<Buyer, Collection<Txn>>
>     m = txns.collect(groupingBy(Txn::**buyer));
>
> Highest-dollar transaction by buyer:
>   Map<Buyer, Transaction>
>     m = txns.collect(
>         groupingReduce(Txn::buyer,
>                        Comparators.greaterOf(
>                            Comparators.comparing(Txn::**amount)));
>
> Here, comparing() takes the Txn -> amount function, and produces a
> Comparator<Txn>; greaterOf(comparator) turns that Comparator into a
> BinaryOperator that corresponds to "max by comparator".  We then reduce on
> that, yielding highest-dollar transaction per buyer.
>
> Alternately, if you want the number, not the transaction:
>   Map<Buyer, Integer>
>     m = txns.collect(groupingReduce(**Txn::buyer,
>                                     Txn::amount, Integer::max));
>
> Transactions by buyer, seller:
>   Map<Buyer, Map<Seller, Collection<Txn>>>
>     m = txns.collect(groupingBy(Txn::**buyer, groupingBy(Txn::seller)));
>
> Transaction volume statistics by buyer, seller:
>
>   Map<Buyer, Map<Seller, LongStatistics>>
>     m = txns.collect(groupingBy(Txn::**buyer,
>                          groupingBy(Txn::seller,
>                                     mapping(Txn::amount,
>                                             toLongStatistics())));
>
> The statistics let you get at min, max, sum, count, and average from a
> single pass on the data (this trick taken from ParallelArray.)
>
> We can mix and match at various levels.  For example:
>
> Transactions by buyer, partitioned int "large/small" groups:
>
>   Predicate<Txn> isLarge = t -> t.amount() > BIG;
>   Map<Buyer, Map<Boolean, Collection<Transaction>>>
>     m = txns.collect(groupingBy(Txn::**buyer, partitioningBy(isLarge)));
>
> Or, turning it around:
>
>   Map<Boolean, Map<Buyer, Collection<Transaction>>>
>     m = txns.collect(partitioningBy(**isLarge, groupingBy(Txn::buyer)));
>
> Because Collector is public, Kevin can write and publish
> Guava-multimap-bearing versions of these -- probably in about ten minutes.
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/lambda-libs-spec-experts/attachments/20130207/f1eb8a4b/attachment-0001.html