RFR(S): 8247307: C2: Loop array fill stub routines are not called
Vladimir Kozlov
vladimir.kozlov at oracle.com
Mon Jun 15 18:41:40 UTC 2020
Hi,
It would be interesting to see difference with strip-mining off.
With strip-mining only part of iterations are replaced with stub.
I don't see referenced link [1] in e-mail.
Are these performance data for Aarch64?
What x86 CPU you tested on? (avx512?)
Please, add both x64 and Aarch64 perf data to RFE.
What size of arrays you tested.
Few years ago OptimizedFill wins over vectorized loops but CPU and vectorization are improved since then. May be we can
deprecate this code if it does not have performance benefits. Or we should revisit stub's code for modern CPUs.
We need more data.
Thanks,
Vladimir
On 6/14/20 8:20 PM, Pengfei Li wrote:
> Hi,
>
> Can I have a review of this C2 loop optimization fix?
>
> JBS: https://bugs.openjdk.java.net/browse/JDK-8247307
> Webrev: http://cr.openjdk.java.net/~pli/rfr/8247307/webrev.00/
>
> C2 has a loop optimization phase called intrinsify_fill. It matches the
> pattern of single array store with an loop invariant in a counted loop,
> like below, and replaces it with call to some stub routine.
>
> for (int i = start; i < limit; i++) {
> a[i] = value;
> }
>
> Unfortunately, this doesn't work in current jdk after loop strip mining.
> The above loop is eventually unrolled and auto-vectorized by subsequent
> optimization phases. Root cause is that in strip-mined loops, the inner
> CountedLoopNode may be used by the address polling node of the safepoint
> in the outer loop. But as the safepoint polling has nothing related to
> any real operations in the loop, it should not hinder the pattern match.
> So in this patch, the polladr's use is ignored in the match check.
>
> We have some performance comparison of the code for array fill, between
> the auto-vectorized version and the stub routine version. The JMH case
> for the tests can be found at [1]. Results show that on x86, the stub
> code is even slower than the auto-vectorized code. To prevent any
> regression, vm option OptimizedFill is turned off for x86 in this patch.
> So this patch doesn't impact on the generated code on x86. On AArch64,
> the two versions show almost the same performance in general cases. But
> if the value to be filled is zero, the stub code's performance is much
> better. This makes sence as AArch64 uses cache maintenance instructions
> (DC ZVA) to zero large blocks in the hand-crafted assembly. Below are
> JMH scores on AArch64.
>
> Before:
> Benchmark Mode Cnt Score Error Units
> TestArrayFill.fillByteArray avgt 25 2078.700 ± 7.719 ns/op
> TestArrayFill.fillIntArray avgt 25 12371.497 ± 566.773 ns/op
> TestArrayFill.fillShortArray avgt 25 4132.439 ± 25.096 ns/op
> TestArrayFill.zeroByteArray avgt 25 2080.313 ± 7.516 ns/op
> TestArrayFill.zeroIntArray avgt 25 10961.331 ± 527.750 ns/op
> TestArrayFill.zeroShortArray avgt 25 4126.386 ± 20.997 ns/op
>
> After:
> Benchmark Mode Cnt Score Error Units
> TestArrayFill.fillByteArray avgt 25 2080.382 ± 2.103 ns/op
> TestArrayFill.fillIntArray avgt 25 11997.621 ± 569.058 ns/op
> TestArrayFill.fillShortArray avgt 25 4309.035 ± 285.456 ns/op
> TestArrayFill.zeroByteArray avgt 25 903.434 ± 10.944 ns/op
> TestArrayFill.zeroIntArray avgt 25 8141.533 ± 946.341 ns/op
> TestArrayFill.zeroShortArray avgt 25 1784.124 ± 24.618 ns/op
>
> Another advantage of using the stub routine is that the generated code
> size is reduced.
>
> Jtreg hotspot::hotspot_all_no_apps, jdk::jdk_core, langtools::tier1 are
> tested and no new failure is found.
>
> --
> Thanks,
> Pengfei
>
More information about the hotspot-compiler-dev
mailing list