Into

Sat Dec 22 06:59:14 PST 2012

On Dec 21, 2012, at 10:47 PM, Brian Goetz <brian.goetz at Oracle.COM> wrote:
> If we get rid of into(), and replace it with an explicit reduction [1], then we may be able to get rid of sequential() too.
> 
> The primary use case for sequential() was for using non-thread-safe Collections as an into() target.  The convoluted into() turns the problem around on the target, calling target.addAll(stream) on it, and the typical implementation of into() is like the one in Collection:
> 
>    default void addAll(Stream<? extends E> stream) {
>        if (stream.isParallel())
>            stream = stream.sequential();
>        stream.forEach(this::add);
>    }
> 
> or more compactly
> 
>    default void addAll(Stream<? extends E> stream) {
>        stream.sequential().forEach(this::add);
>    }
> 
> since sequential() is now a no-op on sequential streams.
> 
> The code looks pretty, but the implementation is not; sequential() is a barrier, meaning you have to stop and collect all the elements into a temporary tree, and then dump them into the target.  But it is not obvious that it is a barrier, so people will be surprised.  (And on infinite streams, it's bad.)
> 

Yes, it catches people out.

> What used to be implicit in sequential() can now be made explicit with:
> 
>  if (stream.isParallel())
>     stream = stream...whateverWeCallOrderPreservingInto().stream()
> 
> That offers similar semantics and at least as good performance, while also being more transparent and requiring one fewer weird stream operation.
> 
> (I don't yet see the way to getting rid of .parallel(), but we can possibly move it out of Stream and into a static method Streams.parallel(stream), at some loss of discoverability.  But we can discuss.)
> 

I was recently thinking about alternative solutions to "sequential()" to reduce the barrier.

An alternative is to retain some form of sequential that is a partial barrier, retains left-to-right information and makes available data from leaf nodes in the correct order.

We could use a special reducing task that pushes leaf nodes onto a blocking queue (perhaps a priority queue if we can provide a sequence identifier/string for each leaf node). A special spliterator could peek/pop off that blocking queue and that spliterator can be used as input to the new sequential stream.

It would likely be more performant (no waiting until all leaf-nodes are processed) and less surprising to the developer (especially with inf streams) at the expense of some internal complexity for the reducer and spliterator pair.

It all depends what the most common cases are to go sequential. If it is mostly a crutch to stuff things into a non-concurrent data structure then we can use the "reducers/tabulators" as you indicate and get rid of sequential, which suggests a refactoring of parallel().

Anyone got any uses cases for a sequential that is a partial barrier?

Paul.