Proposal to enhance Stream.collect

Sun Feb 24 20:28:11 UTC 2019

Thanks for the feedback.

I really like Brian's suggestion for a default
Collector.supplier(sizeEstimate), since exposing the full stream
characteristics is overkill.

I don't think the use of this improvement is limited to pre-sizing
ArrayLists, though. Some other Collectors that come to mind are
StringJoiner (Collectors.joining) and my FutureJoiner [1], which collects a
stream of CompletableFuture<T> into a CompletableFuture<List<T>>.

And even if Collectors.toList was the only collector to benefit from
pre-sizing, I think it is still worth the effort, as this is perhaps the
most used Collector.

[1]: https://gist.github.com/AugustNagro/e92c48e39f891f16500f36c33839383a

On Sun, Feb 24, 2019 at 8:51 AM Brian Goetz <brian.goetz at oracle.com> wrote:

> We did consider this problem when designing the Collector API; for
> example, it would have been nice if we could have a `toArray()`
> collector that had all the optimizations of `Stream::toArray`.
>
> When we looked into it, we found a number of concerning details that
> caused us to turn back (many of which you've already identified), such
> as the difficulty of managing parallelism, the intrusion into the API,
> etc.  What we found is that all this additional complexity was basically
> in aid of only a few use cases -- such as collecting into a pre-sized
> ArrayList.  Where are the next hundred use cases for such a mechanism,
> that would justify this incremental API complexity?  We didn't see them,
> but maybe there are some.
>
> A less intrusive API direction might be a version of Collector whose
> supplier function took a size-estimate argument; this might even help in
> parallel since it allows for intermediate results to start with a better
> initial size guess.  (And this could be implemented as a default that
> delegates to the existing supplier.)  Still, not really sure this
> carries its weight.
>
> >   The below code returns false, for example (is this
> > a bug?):
> >
> > Stream.of(1,2,3).parallel().map(i ->
> > i+1).spliterator().hasCharacteristics(Spliterator.CONCURRENT)
>
> Not a bug.   The `Stream::spliterator` method (along with `iterator`) is
> provided as an "escape hatch" for operations that need to get at the
> elements but which cannot be easily expressed using the Stream API.
> This method makes a good-faith attempt to propagate a reasonable set of
> characteristics (for a stream with no intermediate ops, it does delegate
> to the underlying source for its spliterator), but, given that `Stream`
> is not in fact a data structure, when there is nontrivial computation on
> the actual source, a relatively bare-bones spliterator is provided.
>
> While this could probably be improved in specific cases, the return on
> effort (and risk) is likely to be low, because `Stream::spliterator` is
> already an infrequently used method, and it would only matter in a small
> fraction of those cases.  So you're in "corner case of a corner case"
> territory.
>
>
> On 2/23/2019 5:27 PM, August Nagro wrote:
> > Calling Stream.collect(Collector) is a popular terminal stream operation.
> > But because the collect methods provide no detail of the stream's
> > characteristics, collectors are not as efficient as they could be.
> >
> > For example, consider a non-parallel, sized stream that is to be
> collected
> > as a List. This is a very common case for streams with a Collection
> source.
> > Because of the stream characteristics, the Collector.supplier() could
> > initialize a list with initial size (since the merging function will
> never
> > be called), but the current implementation prevents this.
> >
> > I should note that the characteristics important to collectors are those
> > defined by Spliterator, like: Spliterator::characteristics,
> > Spliterator::estimateSize, and Spliterator::getExactSizeIfKnown.
> >
> > One way this enhancement could be implemented is by adding a method
> > Stream.collect(Function<ReadOnlySpliterator, Collector>
> collectorBuilder).
> > ReadOnlySpliterator would implement the spliterator methods mentioned
> > above, and Spliterator would be made to implement this interface.
> >
> > For example, here is a gist with what Collectors.toList could look like:
> > https://gist.github.com/AugustNagro/e66a0ddf7d47b4f11fec8760281bb538
> >
> > ReadOnlySpliterator may need to be replaced with some stream specific
> > abstraction, however, since Stream.spliterator() does not return with the
> > correct characteristics. The below code returns false, for example (is
> this
> > a bug?):
> >
> > Stream.of(1,2,3).parallel().map(i ->
> > i+1).spliterator().hasCharacteristics(Spliterator.CONCURRENT)
> >
> > Looking forward to your thoughts,
> >
> > - August Nagro
>
>