Post-transform and the standard Collectors

Brian Goetz brian.goetz at oracle.com
Wed Jun 12 12:13:07 PDT 2013


A question this raises: it is now possible (wasn't before) for 
Collectors like minBy to return Optional, like their stream 
counterparts.  However, it is far less likely that such a Collector will 
be invoked on an empty stream than Stream.minBy() will.  Here's why:

If all you're doing is getting the minima of a stream, you're more 
likely to do

   stream.minBy(c)

than

   stream.collect(Collectors.minBy(c))

The more common cases where Collectors.minBy will be used is in the 
downstream of a groupingBy:

   Map<Person, Txn> largestTxnBySeller =
     txns.collect(groupingBy(Txn::seller, maxBy(comparing(Txn::amount)));

Here, we won't create a map key unless there is already one value.

So there are arguments both for and against having these collectors 
collect to Optional.  (If we don't, we should document the value 
associated with no results, which is almost certainly null for minBy, 
maxBy, and reducing(op)).

On 6/12/2013 1:15 PM, Brian Goetz wrote:
> I've done a pass on the standard Collectors to adapt them to the
> post-transform.  Significant changes:
>
>   - All factory methods that returned Collector<T,R> now return
> Collector<T,?,R>.  (It is good that no factory method leaks its internal
> type.)  We can continue to discuss mitigation plans on this, if
> necessary, in a separate thread.
>
>   - The accumulator function in collector is now back to a BiConsumer
> rather than a BiFunction.  This simplified a number of implementations.
>   The STRICTLY_MUTATIVE characteristic goes away entirely.
>
>   - toList is now back to strict ArrayList, as Remi requested.
>
>   - toStringBuilder can now hide its StringBuilder, and collect to a
> String instead.  So I renamed it "concatenating" (and also extended it
> to collect CharSequence instead of String.)
>
>   - toStringJoiner can similarly hide the internal StringJoiner, so was
> renamed to "joining(delimiter)".  (Confusion with database joins is
> possible, open to a better name.)  Also on the to-do list: Paul
> suggested a way to support the full form of StringJoiner (with prefix
> and postfix) so I'll add an overload for that.
>
>   - The various reducing collectors can now use a mutable internal box
> class, and hide that as an implementation detail, eliminating the
> internal boxing in sumBy().
>
>   - It would be nice to overload sumBy(mapper) with int, long, and
> double versions, but unfortunately we have crossed the boundary of what
> type inference can disambiguate.  We have some choices here:
>     - Have a single sumBy(ToLongFunction<T>)
>     - Rename to summingXxx, allowing summingInt(ToIntFunction),
> summingLong(ToLongFunction), ...
>
>   - I want to add averaging() collectors (and now can), which would have
> to follow whatever naming choice we select above.
>
>   - Related, we have separately named toXxxSummaryStatistics which
> follow the same pattern.  If we go with summingInt/averagingInt, maybe
> this becomes summarizingInt?  We also have the opportunity now to make
> the resulting statistics immutable on completion -- do we want to do that?
>
> To put it all in one place, here are the advantages of this additional
> feature:
>
>   - It is the first thing that nearly every users asks for when they see
> Collector; its lack is a significant gap.  We had wanted this from the
> beginning, but earlier versions of Collector made it impossible, but
> later evolutions made it possible again.
>   - It makes possible Collectors like averaging(), which people want and
> which were previously not practical.
>   - It enables Collectors to enforce invariants in the final result that
> cannot be enforced in the intermediate accumulation, such as tree
> balancing, immutability, etc.
>   - It enables Collectors like "toStringBuilder" to not leak their
> internal state (StringBuilder) into the user code, but instead provide
> the result type that the user actually wants (String).
>   - It eliminates the complexity of STRICTLY_MUTATIVE.
>   - It eliminates the performance overhead of boxing during reduction.
>
> In totality, I see these benefits as a huge step forward.  I realize
> there are some rough edges and we can continue to discuss how to file
> them down, or whether we wish to live with them.
>
> I'll be checking these into lambda shortly and posting a link to the
> docs for more detailed review.
>
> On 5/28/2013 6:23 PM, Brian Goetz wrote:
>> Adding the ability to have a post-transform function raises some
>> questions about how the standard collectors should change to
>> accomodate them.  These fall into two categories: - Should we? -
>> How?
>>
>> For collectors like toStringBuilder, we can now collect to a String
>> and not expose the intermediate StringBuilder type.  This is both
>> closer to what the user wants and allows for better implementation
>> hiding:
>>
>> static Collector<String, ?, String> toStringBuilder() { ... }
>>
>> Of course, now the name is wrong.  So it would need a new name.
>> (Ditto for toStringJoiner.)
>>
>> It also makes sense to have a new combinator that can attach a
>> post-transform to an existing Collector (name is just a
>> placeholder):
>>
>> <T, I, R> Collector<T, I, R> transforming(Function<I, R>,
>> Collector<T, ?, I>)
>>
>> A harder question is how much to introduce immutability.  For
>> example, one negative of the current toList() collector is that the
>> returned list is sometimes, but not always, immutable.  It would be
>> nice to be able to commit to something.  We could easily make it
>> immutable with a post-transform of Collections::immutableList.  At
>> first, this seems a no-brainer.  But after more thought, it's
>> definitely a "should we?"
>>
>> Consider how this plays as a downstream collector.  The simplest form
>> of groupingBy -- groupingBy(f) -- expands to groupingBy(f, toList()).
>> If we made toList always return an immutable List, then we would have
>> to apply the post-transform to every value of the resulting map,
>> likely via a (sequential) Map.replaceAll on the simplest groupingBy
>> operation, even when the user didn't care about immutability.  Making
>> every groupingBy user pay for this seems like a lot.  (Alternately,
>> the default toList() could still return an immutable list, but the
>> default groupingBy could use a different downstream collector.)
>>
>> One option is to have mutable and immutable versions of every
>> Collection/Map-bearing Collector.  But this is a 2x explosion of
>> Collectors, after we did so much work to pare back the size of the
>> Collector set.   Another is to have combinators for adding
>> immutability to Collection, List, Set, and Map.   Then an immutable
>> groupingBy would be:
>>
>> collect(asImmutableMap(groupingBy(f, asImmutableList(toList()))));
>>
>> Wordy, but not terrible, and probably better than imposing the costs
>> on everyone?
>>
>>
>>


More information about the lambda-libs-spec-experts mailing list