Performance regression with IntStream.parallel.sum?

Mon Oct 28 05:14:30 PDT 2013

Hi,

See the following maven project for a simple jmh test:

  http://cr.openjdk.java.net/~psandoz/lambda/sum-test.zip

mvn compile
mvn exec:exec -Dexec.executable="java" -Dexec.args="-cp %classpath lambda.StreamSumTest"

If you are running this against tl (or perhaps a most recent jdk8 ea build, i am not sure when it was introduced) you might hit a Hotspot bug [1].

Here are the results running against tl on my macbook pro:

N = 1000, lambda.StreamSumTest.testSeq = 449.57 ns, lambda.StreamSumTest.testStreamPar = 10768.05 ns, lambda.StreamSumTest.testStreamSeq = 5116.70 ns
N = 10000, lambda.StreamSumTest.testSeq = 4369.08 ns, lambda.StreamSumTest.testStreamPar = 26125.72 ns, lambda.StreamSumTest.testStreamSeq = 50704.62 ns
N = 100000, lambda.StreamSumTest.testSeq = 44746.85 ns, lambda.StreamSumTest.testStreamPar = 60928.11 ns, lambda.StreamSumTest.testStreamSeq = 504971.77 ns
N = 1000000, lambda.StreamSumTest.testSeq = 446940.01 ns, lambda.StreamSumTest.testStreamPar = 1456732.71 ns, lambda.StreamSumTest.testStreamSeq = 463216.58 ns
N = 10000000, lambda.StreamSumTest.testSeq = 4599356.27 ns, lambda.StreamSumTest.testStreamPar = 13541336.48 ns, lambda.StreamSumTest.testStreamSeq = 4781332.56 ns

All looks ok up to N=100_000, but then the parallel performance goes south for N=1_000_000 and the sequential measurement is faster that the previous measurement. So something ain't quite right.

Here is a run that reproduces more closely what you observe:

N = 100000, lambda.StreamSumTest.testSeq = 44622.00 ns, lambda.StreamSumTest.testStreamPar = 81310.29 ns, lambda.StreamSumTest.testStreamSeq = 504237.91 ns
N = 150000, lambda.StreamSumTest.testSeq = 66292.57 ns, lambda.StreamSumTest.testStreamPar = 223427.03 ns, lambda.StreamSumTest.testStreamSeq = 745925.67 ns
N = 200000, lambda.StreamSumTest.testSeq = 88635.57 ns, lambda.StreamSumTest.testStreamPar = 301486.68 ns, lambda.StreamSumTest.testStreamSeq = 119910.05 ns
N = 250000, lambda.StreamSumTest.testSeq = 110282.65 ns, lambda.StreamSumTest.testStreamPar = 372250.76 ns, lambda.StreamSumTest.testStreamSeq = 116472.69 ns
N = 300000, lambda.StreamSumTest.testSeq = 132324.48 ns, lambda.StreamSumTest.testStreamPar = 441951.61 ns, lambda.StreamSumTest.testStreamSeq = 180784.40 ns
N = 350000, lambda.StreamSumTest.testSeq = 154606.26 ns, lambda.StreamSumTest.testStreamPar = 506304.43 ns, lambda.StreamSumTest.testStreamSeq = 163264.99 ns
N = 400000, lambda.StreamSumTest.testSeq = 176641.57 ns, lambda.StreamSumTest.testStreamPar = 597616.15 ns, lambda.StreamSumTest.testStreamSeq = 186684.42 ns
N = 450000, lambda.StreamSumTest.testSeq = 198831.18 ns, lambda.StreamSumTest.testStreamPar = 643014.78 ns, lambda.StreamSumTest.testStreamSeq = 210266.43 ns

And another showing min/max/std:

N = 100000, testSeq = [43702.98, 46953.27] ns std = 846.48, testStreamPar = [36440.52, 186045.96] ns std = 52139.57, testStreamSeq = [46368.35, 519820.51] ns std = 198749.40
N = 150000, testSeq = [65730.82, 69287.38] ns std = 985.24, testStreamPar = [46112.90, 283185.92] ns std = 86288.19, testStreamSeq = [739919.71, 798913.39] ns std = 14951.13
N = 200000, testSeq = [87389.80, 93263.40] ns std = 1277.36, testStreamPar = [226219.24, 330973.86] ns std = 33158.44, testStreamSeq = [90188.22, 771664.12] ns std = 290565.98
N = 250000, testSeq = [109214.90, 115433.79] ns std = 1943.37, testStreamPar = [362114.70, 462211.98] ns std = 27934.58, testStreamSeq = [114811.58, 120525.06] ns std = 1707.77
N = 300000, testSeq = [131236.36, 140235.78] ns std = 2188.25, testStreamPar = [425894.96, 509646.46] ns std = 15809.52, testStreamSeq = [137893.59, 1555769.23] ns std = 597755.24
N = 350000, testSeq = [153025.72, 160845.79] ns std = 2439.52, testStreamPar = [496725.49, 593134.50] ns std = 25689.47, testStreamSeq = [160996.82, 169546.53] ns std = 2461.74
N = 400000, testSeq = [174403.45, 185214.29] ns std = 3309.23, testStreamPar = [350224.91, 704840.28] ns std = 80684.56, testStreamSeq = [183756.81, 200096.84] ns std = 4245.38
N = 450000, testSeq = [196297.09, 208527.95] ns std = 3200.05, testStreamPar = [632362.50, 780661.54] ns std = 27159.28, testStreamSeq = [203498.99, 2386279.07] ns std = 1016306.16

I dunno what might be causing this. Wondering if the F/J common pool is unhappy. Plus i need to double check the jmh options with someone more expert than I.

Paul.

[1] https://bugs.openjdk.java.net/browse/JDK-8026735

On Oct 28, 2013, at 10:16 AM, Paul Sandoz <Paul.Sandoz at oracle.com> wrote:

> Hi Sebastian,
> 
> Part of the problem here is that we may not be measuring what we think you are measuring because of the dynamic nature of the JVM. This can make things spooky. I was able to mess around with your latest code and get parallel speed up for certain runs.
> 
> I am gonna try with jmh [1] and report back.
> 
> Paul.
> 
> [1] http://openjdk.java.net/projects/code-tools/jmh/
> 
> On Oct 28, 2013, at 9:11 AM, Sebastian Zarnekow <Sebastian.Zarnekow at itemis.de> wrote:
> 
>> Hi Paul,
>> 
>> I made it a poor-mans benchmark without caliper or other third party deps. The code just loops infinitly and prints the time that it takes to compute the sum with three different flavors:
>> https://gist.github.com/szarnekow/7193025 
>> 
>> The numbers look really spooky to me, e.g. the first lines in the console look like this:
>> 
>> sum took 506 ms
>> parallelSumJava8 took 245 ms
>> sequentialSumJava8 took 896 ms
>> 
>> while it becomes a lot slower after a few iterations and converges to something along these lines:
>> 
>> sum took 448 ms
>> parallelSumJava8 took 1010 ms
>> sequentialSumJava8 took 975 ms
>> 
>> All default memory settings, no tuning args.
>> Unfortunately I cannot run this with b92 right now.
>> 
>> Maybe that helps to analyze the regression.
>> 
>> Best regards,
>> Sebastian
>> 
>> On 27.10.2013, at 14:51, Paul Sandoz wrote:
>> 
>>> Hi Sebastian,
>>> 
>>> The code you have looks fine, but i am not familiar with Caliper. I presume the magnitude of those grey bars represents variance?
>>> 
>>> It's hard to know what might cause such a regression, especially when N is large (> 10^4) which should factor out the costs of going parallel, implying it could be something deeper in the stack. You have noted correctly that there should be no per-element boxing cost (just one box/unbox for the result)
>>> 
>>> I would need to run this myself to investigate. There is an outlier for SequentialSumJava8, 1000000 1.8.0-ea-b92 that makes me suspicious.
>>> 
>>> Paul.
>>> 
>>> On Oct 26, 2013, at 1:03 PM, Sebastian Zarnekow <Sebastian.Zarnekow at itemis.de> wrote:
>>> 
>>>> Hi,
>>>> 
>>>> I think I stumbled on some performance regression in b112. Please consider this simple code:
>>>> 
>>>> public class C {
>>>> 
>>>> public int parallelSum(final int reps) {
>>>>     IntStream stream = Arrays.stream(array).parallel().map(e ->
>>>>             e * 5 * reps
>>>>     );
>>>>     int result += stream.sum();
>>>>     return result;
>>>> }
>>>> 
>>>> C() {
>>>>     this.length = 2;
>>>>     setUp();
>>>>     parallelSum(5);
>>>> }
>>>> 
>>>> private int[] array;
>>>> private int length;
>>>> 
>>>> protected void setUp() {
>>>>     array = new int[length];
>>>>     for(int i = 0; i < length; i++) {
>>>>         array[i] = 3 * i;
>>>>     }
>>>> }
>>>> 
>>>> public static void main(String[] args) {
>>>>     new C();
>>>> }
>>>> }
>>>> 
>>>> which was extracted from some Caliper benchmark code. The benchmark results (code: https://gist.github.com/szarnekow/7168147) are here (produced with build 1.8.0-ea-b112):
>>>> 
>>>> https://microbenchmarks.appspot.com/runs/a0579577-2df9-4894-9b19-ac345f78b020#r:scenario.benchmarkSpec.methodName,scenario.benchmarkSpec.parameters.length
>>>> 
>>>> whereas the following results were produced with an earlier build (1.8.0-ea-b92)
>>>> 
>>>> https://microbenchmarks.appspot.com/runs/220a5db5-434c-43f3-9cd9-93ef869d51e8#r:scenario.benchmarkSpec.methodName,scenario.benchmarkSpec.parameters.length
>>>> 
>>>> Initially thought it can be tracked down to java.util.stream.IntPipeline#reduce(int, java.util.function.IntBinaryOperator) which ultimately causes boxing / unboxing in ReduceOps.makeInt.ReducingSink.get() but the benchmark results for a sequential stream contradicted that. Also it turned out that #get is only called once in the parallel scenario, too. So here I was on the wrong track. Is there some other information that I could provide to track this regression down? Should the code be written differently? 
>>>> 
>>>> Best regards,
>>>> Sebastian
>>>> 
>>>> 
>>> 
>>> 
>> 
> 
>