Post-transform and the standard Collectors
Brian Goetz
brian.goetz at oracle.com
Wed Jun 12 13:59:25 PDT 2013
I've posted a doc snapshot here:
http://cr.openjdk.java.net/~briangoetz/doctmp/doc/
As to the ? issue: looking at declarations like:
static <T,K,D,A,M extends java.util.Map<K,D>>
Collector<T,?,M> groupingBy(...)
there's enough generics noise there that the additional question mark
seems not the worst problem...
On 6/12/2013 3:13 PM, Brian Goetz wrote:
> A question this raises: it is now possible (wasn't before) for
> Collectors like minBy to return Optional, like their stream
> counterparts. However, it is far less likely that such a Collector will
> be invoked on an empty stream than Stream.minBy() will. Here's why:
>
> If all you're doing is getting the minima of a stream, you're more
> likely to do
>
> stream.minBy(c)
>
> than
>
> stream.collect(Collectors.minBy(c))
>
> The more common cases where Collectors.minBy will be used is in the
> downstream of a groupingBy:
>
> Map<Person, Txn> largestTxnBySeller =
> txns.collect(groupingBy(Txn::seller, maxBy(comparing(Txn::amount)));
>
> Here, we won't create a map key unless there is already one value.
>
> So there are arguments both for and against having these collectors
> collect to Optional. (If we don't, we should document the value
> associated with no results, which is almost certainly null for minBy,
> maxBy, and reducing(op)).
>
> On 6/12/2013 1:15 PM, Brian Goetz wrote:
>> I've done a pass on the standard Collectors to adapt them to the
>> post-transform. Significant changes:
>>
>> - All factory methods that returned Collector<T,R> now return
>> Collector<T,?,R>. (It is good that no factory method leaks its internal
>> type.) We can continue to discuss mitigation plans on this, if
>> necessary, in a separate thread.
>>
>> - The accumulator function in collector is now back to a BiConsumer
>> rather than a BiFunction. This simplified a number of implementations.
>> The STRICTLY_MUTATIVE characteristic goes away entirely.
>>
>> - toList is now back to strict ArrayList, as Remi requested.
>>
>> - toStringBuilder can now hide its StringBuilder, and collect to a
>> String instead. So I renamed it "concatenating" (and also extended it
>> to collect CharSequence instead of String.)
>>
>> - toStringJoiner can similarly hide the internal StringJoiner, so was
>> renamed to "joining(delimiter)". (Confusion with database joins is
>> possible, open to a better name.) Also on the to-do list: Paul
>> suggested a way to support the full form of StringJoiner (with prefix
>> and postfix) so I'll add an overload for that.
>>
>> - The various reducing collectors can now use a mutable internal box
>> class, and hide that as an implementation detail, eliminating the
>> internal boxing in sumBy().
>>
>> - It would be nice to overload sumBy(mapper) with int, long, and
>> double versions, but unfortunately we have crossed the boundary of what
>> type inference can disambiguate. We have some choices here:
>> - Have a single sumBy(ToLongFunction<T>)
>> - Rename to summingXxx, allowing summingInt(ToIntFunction),
>> summingLong(ToLongFunction), ...
>>
>> - I want to add averaging() collectors (and now can), which would have
>> to follow whatever naming choice we select above.
>>
>> - Related, we have separately named toXxxSummaryStatistics which
>> follow the same pattern. If we go with summingInt/averagingInt, maybe
>> this becomes summarizingInt? We also have the opportunity now to make
>> the resulting statistics immutable on completion -- do we want to do
>> that?
>>
>> To put it all in one place, here are the advantages of this additional
>> feature:
>>
>> - It is the first thing that nearly every users asks for when they see
>> Collector; its lack is a significant gap. We had wanted this from the
>> beginning, but earlier versions of Collector made it impossible, but
>> later evolutions made it possible again.
>> - It makes possible Collectors like averaging(), which people want and
>> which were previously not practical.
>> - It enables Collectors to enforce invariants in the final result that
>> cannot be enforced in the intermediate accumulation, such as tree
>> balancing, immutability, etc.
>> - It enables Collectors like "toStringBuilder" to not leak their
>> internal state (StringBuilder) into the user code, but instead provide
>> the result type that the user actually wants (String).
>> - It eliminates the complexity of STRICTLY_MUTATIVE.
>> - It eliminates the performance overhead of boxing during reduction.
>>
>> In totality, I see these benefits as a huge step forward. I realize
>> there are some rough edges and we can continue to discuss how to file
>> them down, or whether we wish to live with them.
>>
>> I'll be checking these into lambda shortly and posting a link to the
>> docs for more detailed review.
>>
>> On 5/28/2013 6:23 PM, Brian Goetz wrote:
>>> Adding the ability to have a post-transform function raises some
>>> questions about how the standard collectors should change to
>>> accomodate them. These fall into two categories: - Should we? -
>>> How?
>>>
>>> For collectors like toStringBuilder, we can now collect to a String
>>> and not expose the intermediate StringBuilder type. This is both
>>> closer to what the user wants and allows for better implementation
>>> hiding:
>>>
>>> static Collector<String, ?, String> toStringBuilder() { ... }
>>>
>>> Of course, now the name is wrong. So it would need a new name.
>>> (Ditto for toStringJoiner.)
>>>
>>> It also makes sense to have a new combinator that can attach a
>>> post-transform to an existing Collector (name is just a
>>> placeholder):
>>>
>>> <T, I, R> Collector<T, I, R> transforming(Function<I, R>,
>>> Collector<T, ?, I>)
>>>
>>> A harder question is how much to introduce immutability. For
>>> example, one negative of the current toList() collector is that the
>>> returned list is sometimes, but not always, immutable. It would be
>>> nice to be able to commit to something. We could easily make it
>>> immutable with a post-transform of Collections::immutableList. At
>>> first, this seems a no-brainer. But after more thought, it's
>>> definitely a "should we?"
>>>
>>> Consider how this plays as a downstream collector. The simplest form
>>> of groupingBy -- groupingBy(f) -- expands to groupingBy(f, toList()).
>>> If we made toList always return an immutable List, then we would have
>>> to apply the post-transform to every value of the resulting map,
>>> likely via a (sequential) Map.replaceAll on the simplest groupingBy
>>> operation, even when the user didn't care about immutability. Making
>>> every groupingBy user pay for this seems like a lot. (Alternately,
>>> the default toList() could still return an immutable list, but the
>>> default groupingBy could use a different downstream collector.)
>>>
>>> One option is to have mutable and immutable versions of every
>>> Collection/Map-bearing Collector. But this is a 2x explosion of
>>> Collectors, after we did so much work to pare back the size of the
>>> Collector set. Another is to have combinators for adding
>>> immutability to Collection, List, Set, and Map. Then an immutable
>>> groupingBy would be:
>>>
>>> collect(asImmutableMap(groupingBy(f, asImmutableList(toList()))));
>>>
>>> Wordy, but not terrible, and probably better than imposing the costs
>>> on everyone?
>>>
>>>
>>>
More information about the lambda-libs-spec-experts
mailing list