Proposal to enhance Stream.collect

Brian Goetz brian.goetz at oracle.com
Sun Feb 24 14:51:10 UTC 2019


We did consider this problem when designing the Collector API; for 
example, it would have been nice if we could have a `toArray()` 
collector that had all the optimizations of `Stream::toArray`.

When we looked into it, we found a number of concerning details that 
caused us to turn back (many of which you've already identified), such 
as the difficulty of managing parallelism, the intrusion into the API, 
etc.  What we found is that all this additional complexity was basically 
in aid of only a few use cases -- such as collecting into a pre-sized 
ArrayList.  Where are the next hundred use cases for such a mechanism, 
that would justify this incremental API complexity?  We didn't see them, 
but maybe there are some.

A less intrusive API direction might be a version of Collector whose 
supplier function took a size-estimate argument; this might even help in 
parallel since it allows for intermediate results to start with a better 
initial size guess.  (And this could be implemented as a default that 
delegates to the existing supplier.)  Still, not really sure this 
carries its weight.

>   The below code returns false, for example (is this
> a bug?):
>
> Stream.of(1,2,3).parallel().map(i ->
> i+1).spliterator().hasCharacteristics(Spliterator.CONCURRENT)

Not a bug.   The `Stream::spliterator` method (along with `iterator`) is 
provided as an "escape hatch" for operations that need to get at the 
elements but which cannot be easily expressed using the Stream API.  
This method makes a good-faith attempt to propagate a reasonable set of 
characteristics (for a stream with no intermediate ops, it does delegate 
to the underlying source for its spliterator), but, given that `Stream` 
is not in fact a data structure, when there is nontrivial computation on 
the actual source, a relatively bare-bones spliterator is provided.

While this could probably be improved in specific cases, the return on 
effort (and risk) is likely to be low, because `Stream::spliterator` is 
already an infrequently used method, and it would only matter in a small 
fraction of those cases.  So you're in "corner case of a corner case" 
territory.


On 2/23/2019 5:27 PM, August Nagro wrote:
> Calling Stream.collect(Collector) is a popular terminal stream operation.
> But because the collect methods provide no detail of the stream's
> characteristics, collectors are not as efficient as they could be.
>
> For example, consider a non-parallel, sized stream that is to be collected
> as a List. This is a very common case for streams with a Collection source.
> Because of the stream characteristics, the Collector.supplier() could
> initialize a list with initial size (since the merging function will never
> be called), but the current implementation prevents this.
>
> I should note that the characteristics important to collectors are those
> defined by Spliterator, like: Spliterator::characteristics,
> Spliterator::estimateSize, and Spliterator::getExactSizeIfKnown.
>
> One way this enhancement could be implemented is by adding a method
> Stream.collect(Function<ReadOnlySpliterator, Collector> collectorBuilder).
> ReadOnlySpliterator would implement the spliterator methods mentioned
> above, and Spliterator would be made to implement this interface.
>
> For example, here is a gist with what Collectors.toList could look like:
> https://gist.github.com/AugustNagro/e66a0ddf7d47b4f11fec8760281bb538
>
> ReadOnlySpliterator may need to be replaced with some stream specific
> abstraction, however, since Stream.spliterator() does not return with the
> correct characteristics. The below code returns false, for example (is this
> a bug?):
>
> Stream.of(1,2,3).parallel().map(i ->
> i+1).spliterator().hasCharacteristics(Spliterator.CONCURRENT)
>
> Looking forward to your thoughts,
>
> - August Nagro



More information about the core-libs-dev mailing list