Concerns about parallel streams

Thu Jul 11 13:12:41 PDT 2013

I share Sam's concerns. In particular, the concern about memory, which may
not be immediately obvious.

Are there obvious places to warn about memory use?
On Jul 11, 2013 1:02 PM, "Brian Goetz" <brian.goetz at oracle.com> wrote:

> One thing on my list of things to doc is notes on methods that have
> particularly bad or surprising parallel performance.  #1 on this list is
> limit(n) for large n when the stream is not sized or unordered.  Other
> culprits are collecting to maps (since map merging is expensive.)  Others?
>
> On 7/11/2013 3:20 PM, Sam Pullara wrote:
>
>> As it stands, and it seems we are far past changing this API, it is
>> simply too easy to get a parallel stream without thinking about
>> whether it is the right thing to do. I think we need to extensively
>> document when and why you would use parallel streams vs sequential
>> streams. We should include a cost model, a benchmark that will help
>> people figure out whether they should use it, and perhaps some rules
>> of thumbs where it makes sense. As it stands I think that we are
>> going to see some huge regressions in performance (both memory and
>> cpu usage) when people call .parallel() on streams that should be
>> evaluated sequentially. It would have been great to have the cost
>> model built into the system that would make a good guess as to
>> whether it should use parallel execution.
>>
>> Doug, what are your thoughts? How do you expect people to use it? I
>> can imagine some heuristics that we could put in that might save us —
>> maybe by having a hook that decides when to really do parallel
>> execution that gets executed every N ms with some statistics...
>>
>> Sam
>>
>>