RFR(S): 8247307: C2: Loop array fill stub routines are not called
Pengfei Li
Pengfei.Li at arm.com
Mon Jun 15 03:20:53 UTC 2020
Hi,
Can I have a review of this C2 loop optimization fix?
JBS: https://bugs.openjdk.java.net/browse/JDK-8247307
Webrev: http://cr.openjdk.java.net/~pli/rfr/8247307/webrev.00/
C2 has a loop optimization phase called intrinsify_fill. It matches the
pattern of single array store with an loop invariant in a counted loop,
like below, and replaces it with call to some stub routine.
for (int i = start; i < limit; i++) {
a[i] = value;
}
Unfortunately, this doesn't work in current jdk after loop strip mining.
The above loop is eventually unrolled and auto-vectorized by subsequent
optimization phases. Root cause is that in strip-mined loops, the inner
CountedLoopNode may be used by the address polling node of the safepoint
in the outer loop. But as the safepoint polling has nothing related to
any real operations in the loop, it should not hinder the pattern match.
So in this patch, the polladr's use is ignored in the match check.
We have some performance comparison of the code for array fill, between
the auto-vectorized version and the stub routine version. The JMH case
for the tests can be found at [1]. Results show that on x86, the stub
code is even slower than the auto-vectorized code. To prevent any
regression, vm option OptimizedFill is turned off for x86 in this patch.
So this patch doesn't impact on the generated code on x86. On AArch64,
the two versions show almost the same performance in general cases. But
if the value to be filled is zero, the stub code's performance is much
better. This makes sence as AArch64 uses cache maintenance instructions
(DC ZVA) to zero large blocks in the hand-crafted assembly. Below are
JMH scores on AArch64.
Before:
Benchmark Mode Cnt Score Error Units
TestArrayFill.fillByteArray avgt 25 2078.700 ± 7.719 ns/op
TestArrayFill.fillIntArray avgt 25 12371.497 ± 566.773 ns/op
TestArrayFill.fillShortArray avgt 25 4132.439 ± 25.096 ns/op
TestArrayFill.zeroByteArray avgt 25 2080.313 ± 7.516 ns/op
TestArrayFill.zeroIntArray avgt 25 10961.331 ± 527.750 ns/op
TestArrayFill.zeroShortArray avgt 25 4126.386 ± 20.997 ns/op
After:
Benchmark Mode Cnt Score Error Units
TestArrayFill.fillByteArray avgt 25 2080.382 ± 2.103 ns/op
TestArrayFill.fillIntArray avgt 25 11997.621 ± 569.058 ns/op
TestArrayFill.fillShortArray avgt 25 4309.035 ± 285.456 ns/op
TestArrayFill.zeroByteArray avgt 25 903.434 ± 10.944 ns/op
TestArrayFill.zeroIntArray avgt 25 8141.533 ± 946.341 ns/op
TestArrayFill.zeroShortArray avgt 25 1784.124 ± 24.618 ns/op
Another advantage of using the stub routine is that the generated code
size is reduced.
Jtreg hotspot::hotspot_all_no_apps, jdk::jdk_core, langtools::tier1 are
tested and no new failure is found.
--
Thanks,
Pengfei
More information about the hotspot-compiler-dev
mailing list