Performance regression with IntStream.parallel.sum?
Paul Sandoz
paul.sandoz at oracle.com
Wed Oct 30 07:33:11 PDT 2013
On Oct 29, 2013, at 12:15 PM, Paul Sandoz <Paul.Sandoz at oracle.com> wrote:
>
> On Oct 28, 2013, at 6:01 PM, Paul Sandoz <Paul.Sandoz at oracle.com> wrote:
>
>> On Oct 28, 2013, at 5:15 PM, Paul Sandoz <Paul.Sandoz at oracle.com> wrote:
>>>
>>>>
>>>> Hmmm. Quite strange. Have to evaluate it.
>>>>
>>>
>>> Doh <thump> head hits desk. I forgot that vm flags were not propagated via the options builder to the forked java process:
>>>
>>> .jvmArgs("-XX:-TieredCompilation -Dbenchmark.n=" + n)
>>>
>>> grrr... sorry for the noise. Re-running...
>>>
>>
>> N = 100_000
>> Benchmark Mode Thr Cnt Sec Mean Mean error Units
>> l.StreamSumTest.testStreamPar avgt 1 100 1 39.105 0.317 us/op
>> l.StreamSumTest.testStreamSeq avgt 1 100 1 486.373 1.516 us/op
>>
>>
>> N = 1_000_000
>> Benchmark Mode Thr Cnt Sec Mean Mean error Units
>> l.StreamSumTest.testStreamPar avgt 1 100 1 174.094 8.515 us/op
>> l.StreamSumTest.testStreamSeq avgt 1 100 1 4877.512 18.542 us/op
>>
>>
>> Now i am suspicious of the sequential numbers :-) While i would like to believe them my laptop has only eight hardware threads so 12x and 28x speed ups are highly suspicious.
>>
>> When looking at the sequential iterations (see below) i notice a slow down which kicks in after a number of iterations
>
> On further investigation the JIT compiler is kicking in on stream construction related methods at a later point, which for sequential evaluation is having a negative effect (the jmh "-prof hs_comp" and HotSpot -XX:+PrintCompilation options are very handy in combination with a smaller sample time and increased iterations to better observe when the jump occurs and correlate with HotSpot activity, also -XX:CompileThreshold was useful as well).
>
> Using the following compiler options:
>
> -XX:-TieredCompilation -XX:CompileCommandFile=.hotspot_compiler
>
> $ cat .hotspot_compiler
> exclude java/util/stream/AbstractPipeline evaluate
>
> I now get this result:
>
> Benchmark Mode Thr Cnt Sec Mean Mean error Units
> o.m.s.ForLoopSum100K.sequential avgt 1 20 1 43.097 0.115 us/op
> o.m.s.IntStreamSum100K.parallel avgt 1 20 1 40.090 0.892 us/op
> o.m.s.IntStreamSum100K.sequential avgt 1 20 1 45.711 0.136 us/op
> o.m.s.IntStreamSum1M.parallel avgt 1 20 1 153.193 3.281 us/op
> o.m.s.IntStreamSum1M.sequential avgt 1 20 1 453.525 1.135 us/op
> o.m.s.IntStreamSum5M.parallel avgt 1 20 1 863.744 6.092 us/op
> o.m.s.IntStreamSum5M.sequential avgt 1 20 1 2354.732 11.270 us/op
>
> which is much more reasonable.
>
> Why did i choose to exclude AbstractPipeline.evaluate from compilation? there is a HotSpot related bug associated with that method. Perhaps it is just coincidence, or just the "age of aquarius" :-) I have yet to try excluding other methods. However, it does suggest there might be some errant behaviour in the HotSpot compiler.
>
I think i found the cause. The problem is due to inlining limitations.
When the JIT compiles a method, such as IntPipeline.reduce, and inlines calling methods and methods they call etc, it is very aggressive about inlining. Unfortunately the hottest piece of code deeper down in the stack (namely that of Spliterator.OfInt.forEachRemaining) gets partially inlined and effectively de-optimized, since max inline limit is reached.
If i up the max inline level (e.g. -XX:MaxInlineLevel=11) then there is no measured slow down.
Paul.
More information about the lambda-dev
mailing list