Performance regression with IntStream.parallel.sum?

Mon Oct 28 01:11:31 PDT 2013

Hi Paul,

I made it a poor-mans benchmark without caliper or other third party deps. The code just loops infinitly and prints the time that it takes to compute the sum with three different flavors:
https://gist.github.com/szarnekow/7193025 

The numbers look really spooky to me, e.g. the first lines in the console look like this:

sum took 506 ms
parallelSumJava8 took 245 ms
sequentialSumJava8 took 896 ms

while it becomes a lot slower after a few iterations and converges to something along these lines:

sum took 448 ms
parallelSumJava8 took 1010 ms
sequentialSumJava8 took 975 ms

All default memory settings, no tuning args.
Unfortunately I cannot run this with b92 right now.

Maybe that helps to analyze the regression.

Best regards,
Sebastian

On 27.10.2013, at 14:51, Paul Sandoz wrote:

> Hi Sebastian,
> 
> The code you have looks fine, but i am not familiar with Caliper. I presume the magnitude of those grey bars represents variance?
> 
> It's hard to know what might cause such a regression, especially when N is large (> 10^4) which should factor out the costs of going parallel, implying it could be something deeper in the stack. You have noted correctly that there should be no per-element boxing cost (just one box/unbox for the result)
> 
> I would need to run this myself to investigate. There is an outlier for SequentialSumJava8, 1000000 1.8.0-ea-b92 that makes me suspicious.
> 
> Paul.
> 
> On Oct 26, 2013, at 1:03 PM, Sebastian Zarnekow <Sebastian.Zarnekow at itemis.de> wrote:
> 
>> Hi,
>> 
>> I think I stumbled on some performance regression in b112. Please consider this simple code:
>> 
>> public class C {
>> 
>>   public int parallelSum(final int reps) {
>>       IntStream stream = Arrays.stream(array).parallel().map(e ->
>>               e * 5 * reps
>>       );
>>       int result += stream.sum();
>>       return result;
>>   }
>> 
>>   C() {
>>       this.length = 2;
>>       setUp();
>>       parallelSum(5);
>>   }
>> 
>>   private int[] array;
>>   private int length;
>> 
>>   protected void setUp() {
>>       array = new int[length];
>>       for(int i = 0; i < length; i++) {
>>           array[i] = 3 * i;
>>       }
>>   }
>> 
>>   public static void main(String[] args) {
>>       new C();
>>   }
>> }
>> 
>> which was extracted from some Caliper benchmark code. The benchmark results (code: https://gist.github.com/szarnekow/7168147) are here (produced with build 1.8.0-ea-b112):
>> 
>> https://microbenchmarks.appspot.com/runs/a0579577-2df9-4894-9b19-ac345f78b020#r:scenario.benchmarkSpec.methodName,scenario.benchmarkSpec.parameters.length
>> 
>> whereas the following results were produced with an earlier build (1.8.0-ea-b92)
>> 
>> https://microbenchmarks.appspot.com/runs/220a5db5-434c-43f3-9cd9-93ef869d51e8#r:scenario.benchmarkSpec.methodName,scenario.benchmarkSpec.parameters.length
>> 
>> Initially thought it can be tracked down to java.util.stream.IntPipeline#reduce(int, java.util.function.IntBinaryOperator) which ultimately causes boxing / unboxing in ReduceOps.makeInt.ReducingSink.get() but the benchmark results for a sequential stream contradicted that. Also it turned out that #get is only called once in the parallel scenario, too. So here I was on the wrong track. Is there some other information that I could provide to track this regression down? Should the code be written differently? 
>> 
>> Best regards,
>> Sebastian
>> 
>> 
> 
>