Encounter order

Wed Oct 24 14:12:19 PDT 2012

Right.  While we could choose between Kind A and Kind B, I think 
choosing kind B is kind of silly.  As you say, it already alters the 
semantics and would call for a different API -- results automatically 
become multisets.  Since Java doesn't even have a multiset type, this 
will be foreign -- and also (IMO) not very useful.

The general idea here is that parallelism is an optimization, which can 
affect side-effects, but not the result.  (This is the choice that 
Scala, Clojure, Haskell, Mathematica, Fortress, and others have taken.)

Since Java does so much with side-effects, sometimes it is hard to 
separate the result from the effects, but we should try anyway.  Here's 
an example:

   stream.map(x -> { System.out.println(x); return x; })
         .toArray();

Here, I would say that in either a serial or parallel stream, this 
should result in a copy of the stream source, but in parallel, the 
ordering (and more) of the side-effects (println) are unpredictable. 
(Operations like forEach are all side-effect.)

What this means is that the question of "what should parallel toArray 
do" isn't the right question; the question is "what should toArray do".

For example, one question we could ask is "what happens when you call 
reduce on a stream without an encounter order."  One possibility is 
"hope that the reducer is commutative"; another is to treat reduction on 
a order-less stream as an error.  Unfortunately we don't have ways in 
the type system to tag lambdas as commutative or associative, so the 
user is on their own to avoid GIGO here.  This isn't great, as it 
requires some nonmodular reasoning on the part of the user.  On the 
other hand, this isn't all that different from having to know "is this 
collection thread-safe or not."

On 10/24/2012 4:23 PM, Dan Smith wrote:
> On Oct 23, 2012, at 2:45 PM, Brian Goetz <brian.goetz at oracle.com> wrote:
>
>> OK, let me try this a different way.
>>
>> Let's separate *result* from *side effects*.  Parallelism can change the timing of side-effects, but (I argue) should not change the result of a computation (e.g., summing integers in parallel should yield the same result as summing them sequentially (modulo overflow.))
>
> Sounds like this discussion is mixing two different kinds of parallel streams.
>
> Parallel stream Kind A is a Stream that produces identical results to a serial stream, but is allowed have out-of-order and out-of-thread side effects.
>
> Parallel stream Kind B is a different entity that makes no guarantees about iteration order.  This is not a Stream at all, and should be represented with a different interface; many operations in Stream should not exist in this interface, and those that remain will likely have different contracts (e.g., 'reduce' requires a commutative operator).
>
> (Aside 1: Sorry I don't have better names. :-))
> (Aside 2: I think the decision to merge serial and parallel streams under one interface was made with Kind A in mind.)
>
> The question is, I think, which kind of parallel stream we're really talking about when we say we want "parallel collections."  It's possible the answer is "both," and if that's the case, I think they should be implemented separately, rather than trying to mesh the two into one.  It's also possible that the performance gains enabled by Kind B don't justify its existence.
>
>>   [ 1, 2, 3 ].map( x -> x*2 )
>>
>> Is there anyone claiming the answer is NOT required to be
>>
>>   [ 2, 4, 6 ]
>
> You mean '[ 1, 2, 3 ].parallel().map( x -> x*2 )'.  For Kind B, this means '{ 1, 2, 3 }.map( x -> x*2 )' (using braces to suggest a set).
>
>> (Alternately, this claim amounts to saying that list.parallel().map() should return a multiset rather than a list.)
>
> Even more fundamentally, a Kind B stream should be modeled so that 'list.parallel()' is already a multiset, before 'map' is called.
>
> —Dan
>