Performance regression with IntStream.parallel.sum?

Wed Oct 30 07:33:11 PDT 2013

On Oct 29, 2013, at 12:15 PM, Paul Sandoz <Paul.Sandoz at oracle.com> wrote:

> 
> On Oct 28, 2013, at 6:01 PM, Paul Sandoz <Paul.Sandoz at oracle.com> wrote:
> 
>> On Oct 28, 2013, at 5:15 PM, Paul Sandoz <Paul.Sandoz at oracle.com> wrote:
>>> 
>>>> 
>>>> Hmmm. Quite strange. Have to evaluate it.
>>>> 
>>> 
>>> Doh <thump> head hits desk. I forgot that vm flags were not propagated via the options builder to the forked java process:
>>> 
>>>              .jvmArgs("-XX:-TieredCompilation -Dbenchmark.n=" + n)
>>> 
>>> grrr... sorry for the noise. Re-running...
>>> 
>> 
>> N = 100_000
>> Benchmark                         Mode Thr    Cnt  Sec         Mean   Mean error    Units
>> l.StreamSumTest.testStreamPar     avgt   1    100    1       39.105        0.317    us/op
>> l.StreamSumTest.testStreamSeq     avgt   1    100    1      486.373        1.516    us/op
>> 
>> 
>> N = 1_000_000
>> Benchmark                         Mode Thr    Cnt  Sec         Mean   Mean error    Units
>> l.StreamSumTest.testStreamPar     avgt   1    100    1      174.094        8.515    us/op
>> l.StreamSumTest.testStreamSeq     avgt   1    100    1     4877.512       18.542    us/op
>> 
>> 
>> Now i am suspicious of the sequential numbers :-) While i would like to believe them my laptop has only eight hardware threads so 12x and 28x speed ups are highly suspicious.
>> 
>> When looking at the sequential iterations (see below) i notice a slow down which kicks in after a number of iterations
> 
> On further investigation the JIT compiler is kicking in on stream construction related methods at a later point, which for sequential evaluation is having a negative effect (the jmh "-prof hs_comp" and HotSpot -XX:+PrintCompilation options are very handy in combination with a smaller sample time and increased iterations to better observe when the jump occurs and correlate with HotSpot activity, also -XX:CompileThreshold was useful as well).
> 
> Using the following compiler options:
> 
> -XX:-TieredCompilation -XX:CompileCommandFile=.hotspot_compiler 
> 
> $ cat .hotspot_compiler 
> exclude java/util/stream/AbstractPipeline evaluate
> 
> I now get this result:
> 
> Benchmark                             Mode Thr    Cnt  Sec         Mean   Mean error    Units
> o.m.s.ForLoopSum100K.sequential       avgt   1     20    1       43.097        0.115    us/op
> o.m.s.IntStreamSum100K.parallel       avgt   1     20    1       40.090        0.892    us/op
> o.m.s.IntStreamSum100K.sequential     avgt   1     20    1       45.711        0.136    us/op
> o.m.s.IntStreamSum1M.parallel         avgt   1     20    1      153.193        3.281    us/op
> o.m.s.IntStreamSum1M.sequential       avgt   1     20    1      453.525        1.135    us/op
> o.m.s.IntStreamSum5M.parallel         avgt   1     20    1      863.744        6.092    us/op
> o.m.s.IntStreamSum5M.sequential       avgt   1     20    1     2354.732       11.270    us/op
> 
> which is much more reasonable.
> 
> Why did i choose to exclude AbstractPipeline.evaluate from compilation? there is a HotSpot related bug associated with that method. Perhaps it is just coincidence, or just the "age of aquarius" :-) I have yet to try excluding other methods. However, it does suggest there might be some errant behaviour in the HotSpot compiler.
> 

I think i found the cause. The problem is due to inlining limitations.

When the JIT compiles a method, such as IntPipeline.reduce, and inlines calling methods and methods they call etc, it is very aggressive about inlining. Unfortunately the hottest piece of code deeper down in the stack (namely that of Spliterator.OfInt.forEachRemaining) gets partially inlined and effectively de-optimized, since max inline limit is reached.

If i up the max inline level (e.g. -XX:MaxInlineLevel=11) then there is no measured slow down.

Paul.