Performance regression with IntStream.parallel.sum?

Mon Oct 28 10:01:40 PDT 2013

On Oct 28, 2013, at 5:15 PM, Paul Sandoz <Paul.Sandoz at oracle.com> wrote:
> 
>> 
>> Hmmm. Quite strange. Have to evaluate it.
>> 
> 
> Doh <thump> head hits desk. I forgot that vm flags were not propagated via the options builder to the forked java process:
> 
>                .jvmArgs("-XX:-TieredCompilation -Dbenchmark.n=" + n)
> 
> grrr... sorry for the noise. Re-running...
> 

N = 100_000
Benchmark                         Mode Thr    Cnt  Sec         Mean   Mean error    Units
l.StreamSumTest.testStreamPar     avgt   1    100    1       39.105        0.317    us/op
l.StreamSumTest.testStreamSeq     avgt   1    100    1      486.373        1.516    us/op

N = 1_000_000
Benchmark                         Mode Thr    Cnt  Sec         Mean   Mean error    Units
l.StreamSumTest.testStreamPar     avgt   1    100    1      174.094        8.515    us/op
l.StreamSumTest.testStreamSeq     avgt   1    100    1     4877.512       18.542    us/op

Now i am suspicious of the sequential numbers :-) While i would like to believe them my laptop has only eight hardware threads so 12x and 28x speed ups are highly suspicious.

When looking at the sequential iterations (see below) i notice a slow down which kicks in after a number of iterations (perhaps proportional N) and i observed the same effect with your test program, the benchmark results for which are:

java -XX:-TieredCompilation -jar target/microbenchmarks.jar -i 10 -f 2

Benchmark                             Mode Thr    Cnt  Sec         Mean   Mean error    Units
o.m.s.IntStreamSum100K.parallel       avgt   1     20    1       40.469        1.517    us/op
o.m.s.IntStreamSum100K.sequential     avgt   1     20    1      477.382        4.407    us/op
o.m.s.IntStreamSum1M.parallel         avgt   1     20    1      150.988        1.855    us/op
o.m.s.IntStreamSum1M.sequential       avgt   1     20    1     4124.819      392.108    us/op
o.m.s.IntStreamSum5M.parallel         avgt   1     20    1      866.846        3.700    us/op
o.m.s.IntStreamSum5M.sequential       avgt   1     20    1    12629.711     5837.182    us/op

Paul.

N = 100_000
# Fork: 9 of 10
# Warmup: 20 iterations, 1000 ms each
# Measurement: 10 iterations, 1000 ms each
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Average time, time/op
# Running: lambda.StreamSumTest.testStreamSeq
# Warmup Iteration   1: 89.432 us/op
# Warmup Iteration   2: 484.337 us/op
# Warmup Iteration   3: 494.509 us/op
# Warmup Iteration   4: 483.470 us/op
# Warmup Iteration   5: 487.811 us/op
# Warmup Iteration   6: 485.572 us/op
# Warmup Iteration   7: 489.385 us/op
# Warmup Iteration   8: 488.314 us/op
# Warmup Iteration   9: 493.298 us/op
# Warmup Iteration  10: 497.965 us/op
# Warmup Iteration  11: 483.907 us/op
# Warmup Iteration  12: 494.186 us/op
# Warmup Iteration  13: 492.135 us/op
# Warmup Iteration  14: 486.906 us/op
# Warmup Iteration  15: 492.756 us/op
# Warmup Iteration  16: 494.186 us/op
# Warmup Iteration  17: 494.272 us/op
# Warmup Iteration  18: 493.907 us/op
# Warmup Iteration  19: 495.726 us/op
# Warmup Iteration  20: 495.143 us/op
Iteration   1: 489.998 us/op
Iteration   2: 494.910 us/op
Iteration   3: 496.420 us/op
Iteration   4: 490.313 us/op
Iteration   5: 493.948 us/op
Iteration   6: 498.616 us/op
Iteration   7: 498.998 us/op
Iteration   8: 496.266 us/op
Iteration   9: 488.312 us/op
Iteration  10: 497.052 us/op

N = 1_000_000
# Fork: 9 of 10
# Warmup: 20 iterations, 1000 ms each
# Measurement: 10 iterations, 1000 ms each
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Average time, time/op
# Running: lambda.StreamSumTest.testStreamSeq
# Warmup Iteration   1: 548.453 us/op
# Warmup Iteration   2: 475.805 us/op
# Warmup Iteration   3: 479.079 us/op
# Warmup Iteration   4: 481.045 us/op
# Warmup Iteration   5: 513.081 us/op
# Warmup Iteration   6: 4768.633 us/op
# Warmup Iteration   7: 4810.168 us/op
# Warmup Iteration   8: 4796.000 us/op
# Warmup Iteration   9: 4744.255 us/op
# Warmup Iteration  10: 4863.646 us/op
# Warmup Iteration  11: 4778.114 us/op
# Warmup Iteration  12: 4769.581 us/op
# Warmup Iteration  13: 4750.929 us/op
# Warmup Iteration  14: 4828.577 us/op
# Warmup Iteration  15: 4739.132 us/op
# Warmup Iteration  16: 4824.240 us/op
# Warmup Iteration  17: 4822.423 us/op
# Warmup Iteration  18: 4844.222 us/op
# Warmup Iteration  19: 4777.905 us/op
# Warmup Iteration  20: 4866.481 us/op
Iteration   1: 4832.221 us/op
Iteration   2: 4813.486 us/op
Iteration   3: 4907.794 us/op
Iteration   4: 4861.257 us/op
Iteration   5: 4815.668 us/op
Iteration   6: 4840.097 us/op
Iteration   7: 4861.160 us/op
Iteration   8: 5100.909 us/op
Iteration   9: 4862.112 us/op
Iteration  10: 4863.340 us/op