Refactoring for DRY

Tue Apr 9 16:10:49 PDT 2013

That's a far more complicated question than it looks like.  Let's peel 
this in layers.

 From 10,000 feet, a key design goal of the java.util.stream package is 
that (almost) all operations *can* be done safely either sequentially or 
in parallel, and the only difference would be performance.  So, we have 
no explicit static type for ParallelStream, where the operations might 
be different from Stream.

In order to accomplish that, the user has to follow some rules.  We've 
tried to design the API so that the path of least resistance leads to 
something valid under the rules.

The key rule is that the lambdas passed to the various stream operations 
not be dependent on any state that will change during the course of the 
pipeline execution.  No changing the stream source, no writing to state 
that other lambdas will read, no reading from state that some other 
thread is changing.  This is actually far easier to accomplish than it 
might look like at first, but may require some re-learning.

Once you follow that rule, then a sequential and parallel execution of 
the same pipeline on the same source should return the same result 
(modulo some explicit non-determinism, such as ordering of side-effects 
with forEach and behavior of explicitly nondeterministic operations like 
findAny.)

Returning to your pipeline, the operations in the map-sum version 
qualify under this rule, so it is safe to do it in parallel.  So, should 
you?

Bear in mind that the *only* value of parallelism is that it gets you an 
answer faster; everything else about parallelism is a loss.  Parallel 
computations are less deterministic that sequential ones, and *always* 
involve more total work than a sequential computation.  So the only 
reason to put up with parallelism is that it gives us the answer faster. 
  But it does not always give us the answer faster, and it's not even 
always easy to predict whether it will.

The key parameters in the cost model are N (size of the problem), Q (the 
cost of one iteration), and K (number of cores available.)  If K=1, 
obviously we're not going to win, so the best move is not to try.  If N 
is really small, we're unlikely to win (unless Q is huge) because the 
overhead of setting up and forking tasks will dwarf the cost of just 
doing it sequentially.

Let's assume that N/Q/K are related such that eventually parallelism 
yields a win.  Typically, for low N, it is a loss, and then above some 
breakeven, it starts to win, with the win increasing as N increases. 
Depending on the problem, losing might be as modest as a few percent, or 
could be as bad as an order of magnitude or more.

Returning to your pipeline, this is a low-Q problem; you're calling a 
getter and adding some numbers.  Breakevens for such problems in 
well-tuned systems might be a few hundred or a few thousand (not two and 
not millions.)  Only you know how big the data set is, and how many 
cores you have, so I can't give a one-size-fits-all answer.  But, I can 
tell you that for reasonable sized data sets, your calculation will 
parallelize quite well.  (BTW the library fuses the map+reduce 
operations together into a single parallel pass.)

A key variable in the cost model is the ratio of a compute step to a 
merge step.  The higher the merge cost, the higher the N breakeven.

Returning to your pipeline, your compute step and merge step are the 
same computation -- primitive addition.  This is good; the merge is 
almost trivial.  By contrast, in a case like groupBy to a Map, the merge 
step -- merging two maps by key -- is very expensive.  That's not so 
good, it makes us prefer more sequential computation and less parallelism.

Clear as mud?

On 4/9/2013 6:51 PM, Barry Burd wrote:
> Thank you. Brian. Is there any reason not to use parallelStream here
> (other than that the example is too small to warrant parallelism)? I
> assume that a function like sum() doesn't do it's thing until all the
> various parallel values have arrived.
>  >>> On 4/9/2013 at 5:01 PM, in message <516481A2.8000702 at oracle.com>,
> Brian Goetz <brian.goetz at oracle.com> wrote:
> The recommendation is to format chains like:
> ... Etc.
>
>
>