Refactoring for DRY
Brian Goetz
brian.goetz at oracle.com
Tue Apr 9 16:10:49 PDT 2013
That's a far more complicated question than it looks like. Let's peel
this in layers.
From 10,000 feet, a key design goal of the java.util.stream package is
that (almost) all operations *can* be done safely either sequentially or
in parallel, and the only difference would be performance. So, we have
no explicit static type for ParallelStream, where the operations might
be different from Stream.
In order to accomplish that, the user has to follow some rules. We've
tried to design the API so that the path of least resistance leads to
something valid under the rules.
The key rule is that the lambdas passed to the various stream operations
not be dependent on any state that will change during the course of the
pipeline execution. No changing the stream source, no writing to state
that other lambdas will read, no reading from state that some other
thread is changing. This is actually far easier to accomplish than it
might look like at first, but may require some re-learning.
Once you follow that rule, then a sequential and parallel execution of
the same pipeline on the same source should return the same result
(modulo some explicit non-determinism, such as ordering of side-effects
with forEach and behavior of explicitly nondeterministic operations like
findAny.)
Returning to your pipeline, the operations in the map-sum version
qualify under this rule, so it is safe to do it in parallel. So, should
you?
Bear in mind that the *only* value of parallelism is that it gets you an
answer faster; everything else about parallelism is a loss. Parallel
computations are less deterministic that sequential ones, and *always*
involve more total work than a sequential computation. So the only
reason to put up with parallelism is that it gives us the answer faster.
But it does not always give us the answer faster, and it's not even
always easy to predict whether it will.
The key parameters in the cost model are N (size of the problem), Q (the
cost of one iteration), and K (number of cores available.) If K=1,
obviously we're not going to win, so the best move is not to try. If N
is really small, we're unlikely to win (unless Q is huge) because the
overhead of setting up and forking tasks will dwarf the cost of just
doing it sequentially.
Let's assume that N/Q/K are related such that eventually parallelism
yields a win. Typically, for low N, it is a loss, and then above some
breakeven, it starts to win, with the win increasing as N increases.
Depending on the problem, losing might be as modest as a few percent, or
could be as bad as an order of magnitude or more.
Returning to your pipeline, this is a low-Q problem; you're calling a
getter and adding some numbers. Breakevens for such problems in
well-tuned systems might be a few hundred or a few thousand (not two and
not millions.) Only you know how big the data set is, and how many
cores you have, so I can't give a one-size-fits-all answer. But, I can
tell you that for reasonable sized data sets, your calculation will
parallelize quite well. (BTW the library fuses the map+reduce
operations together into a single parallel pass.)
A key variable in the cost model is the ratio of a compute step to a
merge step. The higher the merge cost, the higher the N breakeven.
Returning to your pipeline, your compute step and merge step are the
same computation -- primitive addition. This is good; the merge is
almost trivial. By contrast, in a case like groupBy to a Map, the merge
step -- merging two maps by key -- is very expensive. That's not so
good, it makes us prefer more sequential computation and less parallelism.
Clear as mud?
On 4/9/2013 6:51 PM, Barry Burd wrote:
> Thank you. Brian. Is there any reason not to use parallelStream here
> (other than that the example is too small to warrant parallelism)? I
> assume that a function like sum() doesn't do it's thing until all the
> various parallel values have arrived.
> >>> On 4/9/2013 at 5:01 PM, in message <516481A2.8000702 at oracle.com>,
> Brian Goetz <brian.goetz at oracle.com> wrote:
> The recommendation is to format chains like:
> ... Etc.
>
>
>
More information about the lambda-dev
mailing list