Odd decrease of benchmark throughput

Tue Sep 6 22:27:32 UTC 2016

I have seen this with my benchmarking experiments as well.
One of the things that I also observed in my case is that this behavior did not manifest with the G1GC (identical benchmark, default maxinlinelevel) which is still a mystery to me.
(for more detail on my experiments see: http://zolyfarkas.github.io/spf4j/spf4j-benchmarks/CrazyJVM.pdf <http://zolyfarkas.github.io/spf4j/spf4j-benchmarks/CrazyJVM.pdf>)

I researched this a little and found this https://groups.google.com/forum/#!msg/mechanical-sympathy/m4opvy4xq3U/7lY8x8SvHgwJ <https://groups.google.com/forum/#!msg/mechanical-sympathy/m4opvy4xq3U/7lY8x8SvHgwJ> from Aleksey:

"K. Inlining

The beast of the beasts: for many benchmarks, the performance differences
can only be explained by the inlining differences, which broke/enabled some
additional compiler optimizations. Hence, playing nice with the inliner is 
essential for benchmark harness. Again, pushing users to deal with this 
completely on their own is cruel, and we can ease their pain a bit.

JMH does two things: 1) It peels the hottest measurement loop in a separate
method, which provides the entry point for compilation, and the inlining 
budget starts there; 2) @CompilerControl annotation to control inlining
in some known places (@GMB and Blackhole methods are forcefully inlined these
days, for example)."

My guess is that using @CompilerControl on the Benchmark method to disable inlining of the benchmark method might “alleviate" this…

But what the benchmark might highlight is that the actual code might work well only if the default max inline level is increased…

In any case I wish the JVM could be smarter here and not make things slower as it warms up… easier said than done :-)

—Z

> On Sep 6, 2016, at 11:12 AM, Dávid Karnok <akarnokd at gmail.com> wrote:
> 
> I think I found the problem. The JITWatch's own analysis indicated (should
> have looked at that earlier) that two of the hottest methods couldn't be
> inlined in L4 because of being too deep in the call stack. Adding
> -XX:MaxInlineLevel=20 the perf was ~650 ops/s for any number of measure
> iterations.
> 
> So it seems that until the @Benchmark method got inlined, everything was
> relatively fine but then once that outermost method became eligible for
> JIT-ting, the hot path fell below the default inline level and the
> resulting code was now 3x slower.
> 
> Thank you for your time.
> 
> 
> 2016-09-06 16:18 GMT+02:00 Dávid Karnok <akarnokd at gmail.com>:
> 
>> Thank you for the answer. I guess I'd need xperf for Windows but that tool
>> is Win 8+. I'll try my luck with JITWatch again to see the difference in C1
>> and C2 assemblies.
>> 
>> 2016-09-06 16:00 GMT+02:00 Aleksey Shipilev <ashipile at redhat.com>:
>> 
>>> On 09/06/2016 01:05 PM, Dávid Karnok wrote:
>>>> # Run progress: 16,67% complete, ETA 00:01:47
>>>> # Fork: 1 of 1
>>>> # Warmup Iteration   1: 622,250 ops/s
>>>> # Warmup Iteration   2: 646,154 ops/s
>>>> # Warmup Iteration   3: 637,035 ops/s
>>>> # Warmup Iteration   4: 639,014 ops/s
>>>> # Warmup Iteration   5: 645,212 ops/s
>>>> Iteration   1: 648,120 ops/s
>>>> Iteration   2: 647,042 ops/s
>>>> Iteration   3: 650,176 ops/s
>>>> Iteration   4: 335,979 ops/s
>>>> Iteration   5: 195,415 ops/s
>>>> 
>>>> (Running Windows 7 x64, Java 8u102, i7 4790)
>>>> 
>>>> Please advise.
>>> 
>>> We have seen the behavior like that before.
>>> 
>>> The way to further diagnose this: prepare two runs where measurement
>>> phase a) has only 650 ops/s iterations; b) has only 195 ops/s iterations
>>> -- vary warmup/measurement durations to fit. After that, -prof perfasm
>>> both runs and see where the difference in profiles is. perfasm takes
>>> only the measurement phase in the consideration.
>>> 
>>> 99% bet is on different compilation, and it is important to know what
>>> exactly compiled differently in 195 ops/sec iterations.
>>> 
>>> Thanks,
>>> -Aleksey
>>> 
>> 
>> 
>> 
>> --
>> Best regards,
>> David Karnok
>> 
> 
> 
> 
> -- 
> Best regards,
> David Karnok