Concerns about parallel streams

Thu Jul 11 13:02:08 PDT 2013

One thing on my list of things to doc is notes on methods that have 
particularly bad or surprising parallel performance.  #1 on this list is 
limit(n) for large n when the stream is not sized or unordered.  Other 
culprits are collecting to maps (since map merging is expensive.)  Others?

On 7/11/2013 3:20 PM, Sam Pullara wrote:
> As it stands, and it seems we are far past changing this API, it is
> simply too easy to get a parallel stream without thinking about
> whether it is the right thing to do. I think we need to extensively
> document when and why you would use parallel streams vs sequential
> streams. We should include a cost model, a benchmark that will help
> people figure out whether they should use it, and perhaps some rules
> of thumbs where it makes sense. As it stands I think that we are
> going to see some huge regressions in performance (both memory and
> cpu usage) when people call .parallel() on streams that should be
> evaluated sequentially. It would have been great to have the cost
> model built into the system that would make a good guess as to
> whether it should use parallel execution.
>
> Doug, what are your thoughts? How do you expect people to use it? I
> can imagine some heuristics that we could put in that might save us —
> maybe by having a hook that decides when to really do parallel
> execution that gets executed every N ms with some statistics...
>
> Sam
>