Concerns about parallel streams

Thu Jul 11 23:37:16 PDT 2013

On 12/07/2013 2:39 PM, Sam Pullara wrote:
> My point is that the programmer doesn't know. It only is known whether to use parallel mode or not at runtime under the specific performance and load circumstances in that environment. Unless all of that is fully specified at compile time, how would you decide whether to use parallel or sequential? My belief is that you can have an intuition that it might be better in some circumstances so you make it possible by requesting it, but only at runtime would the system actually run something in parallel after it verifies that the conditions merit it. The P, C, N and Q are all probably variable at runtime in the vast majority of use cases unless they are designing a system to run only on a specific piece of hardware, by itself, with a known data size and predictable algorithm performance.

That is why I said the programmer has to be selective about what they 
parallelize and then require the runtime operator to opt-in to that.

If I'm writing an app I can identify potential operations that would 
benefit from parallelism. But as you say I can't know for sure that in 
the final deployment this will be a good thing. Hence the deployer makes 
that final choice.

David

> Sam
>
> On Jul 11, 2013, at 9:29 PM, David Holmes <david.holmes at oracle.com> wrote:
>
>> On 12/07/2013 2:26 PM, Sam Pullara wrote:
>>> On Jul 11, 2013, at 8:35 PM, David Holmes <david.holmes at oracle.com> wrote:
>>>> On 12/07/2013 5:20 AM, Sam Pullara wrote:
>>>>> As it stands, and it seems we are far past changing this API, it is simply too easy to get a parallel stream without thinking about whether it is the right thing to do. I think we need to extensively document when and why you would use parallel streams vs sequential streams. We should include a cost model, a benchmark that will help people figure out whether they should use it, and perhaps some rules of thumbs where it makes sense. As it stands I think that we are going to see some huge regressions in performance (both memory and cpu usage) when people call .parallel() on streams that should be evaluated sequentially. It would have been great to have the cost model built into the system that would make a good guess as to whether it should use parallel execution.
>>>>
>>>> I think we addressed this at the start with the decision to require
>>>> explicit rather than automatic parallelism. Hence I totally oppose any
>>>> proposal that we run in sequential mode until we have used up a
>>>> timeslice - that's the automatic parallelism path.
>>>
>>> You misunderstand me. I mean if you ask explicitly for parallel mode to not actually use it until we verify that you haven't made a big error. I agree with you except that I think we should protect them from making a big mistake when it is enabled and is unnecessary.
>>
>> I don't agree with treating programmers like children. If they ask for
>> parallel they get parallel. Who are we to try and second guess if they
>> know what they are asking for?
>>
>> David
>>
>>> Sam
>>>