RFR: 8276066: Reset LoopPercentProfileLimit for x86 due to suboptimal performance

Mon Nov 1 23:59:10 UTC 2021

On Wed, 27 Oct 2021 14:26:49 GMT, Jie Fu <jiefu at openjdk.org> wrote:

> Hi all,
> 
> I'd like to reset the value of `LoopPercentProfileLimit` (from 30 to the original 10) for x86 to fix performance degradation.
> 
> We had observed that for the same Java App, the performance of x86 is slower than that of aarch64.
> But the x86's performance should not be so worse than the aarch64 according to some SPEC benchmark results.
> 
> After some investigation, it seems that the slowness of x86 is caused by the different default settings of `LoopPercentProfileLimit` (30 for x86, but 10 for other platforms).
> If we change `LoopPercentProfileLimit` from 30 to 10, x86 would run faster.
> 
> In JDK-8149421, `LoopPercentProfileLimit` [1] was first added and set to be 30 for x86 and 10 for other platforms.
> Logically, the default value of `LoopPercentProfileLimit` is 10 for all platforms even before JDK-8149421.
> This is because when `LoopPercentProfileLimit=10`, `10.0` [2] equals `100.0 / LoopPercentProfileLimit` [3].
> So if we set `LoopPercentProfileLimit=10`, this unrolling rule [3] would be the same as the original design before JDK-8149421.
> 
> One most important fact is that from the very beginning of OpenJDK source code, the default value of `LoopPercentProfileLimit` (logically) is 10 for all platforms.   
> So I suggest resetting `LoopPercentProfileLimit` to the original value (10) for x86, just as other platforms.
> 
> I've noted that the review thread mentioned that JDK-8149421 would be beneficial for some SPECjvm2008 benchmarks [4].
> Then I run SPECjvm2008 with `LoopPercentProfileLimit=10` finding that there is no performance drop on x86.
> So it won't revert JDK-8149421's opts for SPECjvm2008.
> 
> To show the potential improvement of this change, I've made a jmh test in the patch.
> Performance can be improved by 1.25x ~ 2.0x according to this micro benchmark.
> 
> Any comments?
> 
> Thanks.
> Best regards,
> Jie
> 
>   
> [1] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/loopTransform.cpp#L908
> [2] https://github.com/openjdk/jdk8u/blob/master/hotspot/src/share/vm/opto/loopTransform.cpp#L673
> [3] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/loopTransform.cpp#L903
> [4] https://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2016-February/021205.html
> 
> <img width="420" alt="ratio" src="https://user-images.githubusercontent.com/19923746/139084273-aa2e2eae-4a74-4fcb-8430-d2e2e49d5d5c.png">
> 
> <img width="615" alt="before" src="https://user-images.githubusercontent.com/19923746/139084508-793f1109-1ce3-4427-a2e3-660c91758a7c.png">
> 
> <img width="617" alt="after" src="https://user-images.githubusercontent.com/19923746/139084542-c8a8a705-d7ed-499f-9ceb-7175671c0e3b.png">

Hi all,

I tested our deep learning cluster using jdk17 yesterday.
The time of computing stage can be further reduced by 5% ~ 6% with this patch. 
So it's really worth making this change.
And we also plan to backport it if it is accepted in the jdk mainline.

If post loop vectorization requires higher LoopPercentProfileLimit on x86, we can still re-tune it in that enhancement just like other platforms.
I also think it would be better to improve the base line performance with `LoopPercentProfileLimit=10` when evaluating the performance opts like post loop vectorization on x86.

Thanks.
Best regards,
Jie

-------------

PR: https://git.openjdk.java.net/jdk/pull/6142