Post-transform and the standard Collectors

Wed Jun 12 13:59:25 PDT 2013

I've posted a doc snapshot here:
   http://cr.openjdk.java.net/~briangoetz/doctmp/doc/

As to the ? issue: looking at declarations like:

static <T,K,D,A,M extends java.util.Map<K,D>>
Collector<T,?,M> groupingBy(...)

there's enough generics noise there that the additional question mark 
seems not the worst problem...

On 6/12/2013 3:13 PM, Brian Goetz wrote:
> A question this raises: it is now possible (wasn't before) for
> Collectors like minBy to return Optional, like their stream
> counterparts.  However, it is far less likely that such a Collector will
> be invoked on an empty stream than Stream.minBy() will.  Here's why:
>
> If all you're doing is getting the minima of a stream, you're more
> likely to do
>
>    stream.minBy(c)
>
> than
>
>    stream.collect(Collectors.minBy(c))
>
> The more common cases where Collectors.minBy will be used is in the
> downstream of a groupingBy:
>
>    Map<Person, Txn> largestTxnBySeller =
>      txns.collect(groupingBy(Txn::seller, maxBy(comparing(Txn::amount)));
>
> Here, we won't create a map key unless there is already one value.
>
> So there are arguments both for and against having these collectors
> collect to Optional.  (If we don't, we should document the value
> associated with no results, which is almost certainly null for minBy,
> maxBy, and reducing(op)).
>
> On 6/12/2013 1:15 PM, Brian Goetz wrote:
>> I've done a pass on the standard Collectors to adapt them to the
>> post-transform.  Significant changes:
>>
>>   - All factory methods that returned Collector<T,R> now return
>> Collector<T,?,R>.  (It is good that no factory method leaks its internal
>> type.)  We can continue to discuss mitigation plans on this, if
>> necessary, in a separate thread.
>>
>>   - The accumulator function in collector is now back to a BiConsumer
>> rather than a BiFunction.  This simplified a number of implementations.
>>   The STRICTLY_MUTATIVE characteristic goes away entirely.
>>
>>   - toList is now back to strict ArrayList, as Remi requested.
>>
>>   - toStringBuilder can now hide its StringBuilder, and collect to a
>> String instead.  So I renamed it "concatenating" (and also extended it
>> to collect CharSequence instead of String.)
>>
>>   - toStringJoiner can similarly hide the internal StringJoiner, so was
>> renamed to "joining(delimiter)".  (Confusion with database joins is
>> possible, open to a better name.)  Also on the to-do list: Paul
>> suggested a way to support the full form of StringJoiner (with prefix
>> and postfix) so I'll add an overload for that.
>>
>>   - The various reducing collectors can now use a mutable internal box
>> class, and hide that as an implementation detail, eliminating the
>> internal boxing in sumBy().
>>
>>   - It would be nice to overload sumBy(mapper) with int, long, and
>> double versions, but unfortunately we have crossed the boundary of what
>> type inference can disambiguate.  We have some choices here:
>>     - Have a single sumBy(ToLongFunction<T>)
>>     - Rename to summingXxx, allowing summingInt(ToIntFunction),
>> summingLong(ToLongFunction), ...
>>
>>   - I want to add averaging() collectors (and now can), which would have
>> to follow whatever naming choice we select above.
>>
>>   - Related, we have separately named toXxxSummaryStatistics which
>> follow the same pattern.  If we go with summingInt/averagingInt, maybe
>> this becomes summarizingInt?  We also have the opportunity now to make
>> the resulting statistics immutable on completion -- do we want to do
>> that?
>>
>> To put it all in one place, here are the advantages of this additional
>> feature:
>>
>>   - It is the first thing that nearly every users asks for when they see
>> Collector; its lack is a significant gap.  We had wanted this from the
>> beginning, but earlier versions of Collector made it impossible, but
>> later evolutions made it possible again.
>>   - It makes possible Collectors like averaging(), which people want and
>> which were previously not practical.
>>   - It enables Collectors to enforce invariants in the final result that
>> cannot be enforced in the intermediate accumulation, such as tree
>> balancing, immutability, etc.
>>   - It enables Collectors like "toStringBuilder" to not leak their
>> internal state (StringBuilder) into the user code, but instead provide
>> the result type that the user actually wants (String).
>>   - It eliminates the complexity of STRICTLY_MUTATIVE.
>>   - It eliminates the performance overhead of boxing during reduction.
>>
>> In totality, I see these benefits as a huge step forward.  I realize
>> there are some rough edges and we can continue to discuss how to file
>> them down, or whether we wish to live with them.
>>
>> I'll be checking these into lambda shortly and posting a link to the
>> docs for more detailed review.
>>
>> On 5/28/2013 6:23 PM, Brian Goetz wrote:
>>> Adding the ability to have a post-transform function raises some
>>> questions about how the standard collectors should change to
>>> accomodate them.  These fall into two categories: - Should we? -
>>> How?
>>>
>>> For collectors like toStringBuilder, we can now collect to a String
>>> and not expose the intermediate StringBuilder type.  This is both
>>> closer to what the user wants and allows for better implementation
>>> hiding:
>>>
>>> static Collector<String, ?, String> toStringBuilder() { ... }
>>>
>>> Of course, now the name is wrong.  So it would need a new name.
>>> (Ditto for toStringJoiner.)
>>>
>>> It also makes sense to have a new combinator that can attach a
>>> post-transform to an existing Collector (name is just a
>>> placeholder):
>>>
>>> <T, I, R> Collector<T, I, R> transforming(Function<I, R>,
>>> Collector<T, ?, I>)
>>>
>>> A harder question is how much to introduce immutability.  For
>>> example, one negative of the current toList() collector is that the
>>> returned list is sometimes, but not always, immutable.  It would be
>>> nice to be able to commit to something.  We could easily make it
>>> immutable with a post-transform of Collections::immutableList.  At
>>> first, this seems a no-brainer.  But after more thought, it's
>>> definitely a "should we?"
>>>
>>> Consider how this plays as a downstream collector.  The simplest form
>>> of groupingBy -- groupingBy(f) -- expands to groupingBy(f, toList()).
>>> If we made toList always return an immutable List, then we would have
>>> to apply the post-transform to every value of the resulting map,
>>> likely via a (sequential) Map.replaceAll on the simplest groupingBy
>>> operation, even when the user didn't care about immutability.  Making
>>> every groupingBy user pay for this seems like a lot.  (Alternately,
>>> the default toList() could still return an immutable list, but the
>>> default groupingBy could use a different downstream collector.)
>>>
>>> One option is to have mutable and immutable versions of every
>>> Collection/Map-bearing Collector.  But this is a 2x explosion of
>>> Collectors, after we did so much work to pare back the size of the
>>> Collector set.   Another is to have combinators for adding
>>> immutability to Collection, List, Set, and Map.   Then an immutable
>>> groupingBy would be:
>>>
>>> collect(asImmutableMap(groupingBy(f, asImmutableList(toList()))));
>>>
>>> Wordy, but not terrible, and probably better than imposing the costs
>>> on everyone?
>>>
>>>
>>>