Concerns about parallel streams

Thu Jul 11 20:35:46 PDT 2013

Sam,

On 12/07/2013 5:20 AM, Sam Pullara wrote:
> As it stands, and it seems we are far past changing this API, it is simply too easy to get a parallel stream without thinking about whether it is the right thing to do. I think we need to extensively document when and why you would use parallel streams vs sequential streams. We should include a cost model, a benchmark that will help people figure out whether they should use it, and perhaps some rules of thumbs where it makes sense. As it stands I think that we are going to see some huge regressions in performance (both memory and cpu usage) when people call .parallel() on streams that should be evaluated sequentially. It would have been great to have the cost model built into the system that would make a good guess as to whether it should use parallel execution.

I think we addressed this at the start with the decision to require 
explicit rather than automatic parallelism. Hence I totally oppose any 
proposal that we run in sequential mode until we have used up a 
timeslice - that's the automatic parallelism path.

Continuing on that explicit path, just as our libraries require explicit 
parallelism selection, so applications should also require/allow it. If 
an app chooses to always use parallel() then that is "automatic 
parallelism" at the app level - and that is as bad as auto-parallelism 
at the library level. Programmers don't have the runtime knowledge 
needed to determine whether parallelism will "work" - that is something 
that application deployers need to choose.

So my advice for the docs here is two fold:

a) programmers should stick with sequential unless parallel can be shown 
to have a significant benefit; and

b) programmers should allow deployers/end-users to opt-in to parallelism 
where they have enabled it, rather than enabling it automatically.

My 2c.

Cheers,
David
------

> Doug, what are your thoughts? How do you expect people to use it? I can imagine some heuristics that we could put in that might save us — maybe by having a hook that decides when to really do parallel execution that gets executed every N ms with some statistics...
>
> Sam
>