Concerns about parallel streams
David Holmes
david.holmes at oracle.com
Thu Jul 11 23:58:51 PDT 2013
On 12/07/2013 4:49 PM, Sam Pullara wrote:
> I don't think the deployer knows either and certainly can't make that decision for each individual stream in the system. Without a way to inject heuristics into the decision making process that make the choice based on measurements, I think we should probably not recommend its use outside of edge cases where everything is known. Deployment time is not really the same as runtime since you will not entirely know P, C, N and Q at that point though you might have a better idea about a few of them.
I'm using deployment/runtime interchangeably which is not completely
accurate. Point is the programmer should only enable potential
parallelism and some one down the line has to then choose to actually
use it for runtime. Is that practical? Not if there are many such
decision points - but I don't think realistic apps will have that many.
If they do then the possible tuning permutations will make things
intractable anyway.
That said to make the selection in the code you really would want a
stream(boolean parallel) method, otherwise it is going to be ugly.
> Anyway, we don't have anything like this in the Javadocs. My concern at this point, considering the API is done, is to make sure that people understand how hard it is to use this feature correctly. Doug's earlier advice, paraphrased: "Millions of elements that each use millions of instructions to process, on an unshared system with an embarrassingly parallel pipeline is an ideal place to try using it.". I'm hoping that we can add the heuristics callback in JDK 9 at this point.
Yes guidance is needed. This is a very sharp tool.
David
> Sam
>
> On Jul 11, 2013, at 11:37 PM, David Holmes <david.holmes at oracle.com> wrote:
>
>> On 12/07/2013 2:39 PM, Sam Pullara wrote:
>>> My point is that the programmer doesn't know. It only is known whether to use parallel mode or not at runtime under the specific performance and load circumstances in that environment. Unless all of that is fully specified at compile time, how would you decide whether to use parallel or sequential? My belief is that you can have an intuition that it might be better in some circumstances so you make it possible by requesting it, but only at runtime would the system actually run something in parallel after it verifies that the conditions merit it. The P, C, N and Q are all probably variable at runtime in the vast majority of use cases unless they are designing a system to run only on a specific piece of hardware, by itself, with a known data size and predictable algorithm performance.
>>
>> That is why I said the programmer has to be selective about what they
>> parallelize and then require the runtime operator to opt-in to that.
>>
>> If I'm writing an app I can identify potential operations that would
>> benefit from parallelism. But as you say I can't know for sure that in
>> the final deployment this will be a good thing. Hence the deployer makes
>> that final choice.
>>
>> David
>>
>>> Sam
>>>
>>> On Jul 11, 2013, at 9:29 PM, David Holmes <david.holmes at oracle.com> wrote:
>>>
>>>> On 12/07/2013 2:26 PM, Sam Pullara wrote:
>>>>> On Jul 11, 2013, at 8:35 PM, David Holmes <david.holmes at oracle.com> wrote:
>>>>>> On 12/07/2013 5:20 AM, Sam Pullara wrote:
>>>>>>> As it stands, and it seems we are far past changing this API, it is simply too easy to get a parallel stream without thinking about whether it is the right thing to do. I think we need to extensively document when and why you would use parallel streams vs sequential streams. We should include a cost model, a benchmark that will help people figure out whether they should use it, and perhaps some rules of thumbs where it makes sense. As it stands I think that we are going to see some huge regressions in performance (both memory and cpu usage) when people call .parallel() on streams that should be evaluated sequentially. It would have been great to have the cost model built into the system that would make a good guess as to whether it should use parallel execution.
>>>>>>
>>>>>> I think we addressed this at the start with the decision to require
>>>>>> explicit rather than automatic parallelism. Hence I totally oppose any
>>>>>> proposal that we run in sequential mode until we have used up a
>>>>>> timeslice - that's the automatic parallelism path.
>>>>>
>>>>> You misunderstand me. I mean if you ask explicitly for parallel mode to not actually use it until we verify that you haven't made a big error. I agree with you except that I think we should protect them from making a big mistake when it is enabled and is unnecessary.
>>>>
>>>> I don't agree with treating programmers like children. If they ask for
>>>> parallel they get parallel. Who are we to try and second guess if they
>>>> know what they are asking for?
>>>>
>>>> David
>>>>
>>>>> Sam
>>>>>
More information about the lambda-libs-spec-observers
mailing list