RFR(S): 8247307: C2: Loop array fill stub routines are not called

Mon Jun 15 18:41:40 UTC 2020

Hi,

It would be interesting to see difference with strip-mining off.
With strip-mining only part of iterations are replaced with stub.

I don't see referenced link [1] in e-mail.

Are these performance data for Aarch64?

What x86 CPU you tested on? (avx512?)

Please, add both x64 and Aarch64 perf data to RFE.

What size of arrays you tested.

Few years ago OptimizedFill wins over vectorized loops but CPU and vectorization are improved since then. May be we can 
deprecate this code if it does not have performance benefits. Or we should revisit stub's code for modern CPUs.

We need more data.

Thanks,
Vladimir

On 6/14/20 8:20 PM, Pengfei Li wrote:
> Hi,
> 
> Can I have a review of this C2 loop optimization fix?
> 
> JBS: https://bugs.openjdk.java.net/browse/JDK-8247307
> Webrev: http://cr.openjdk.java.net/~pli/rfr/8247307/webrev.00/
> 
> C2 has a loop optimization phase called intrinsify_fill. It matches the
> pattern of single array store with an loop invariant in a counted loop,
> like below, and replaces it with call to some stub routine.
> 
>    for (int i = start; i < limit; i++) {
>      a[i] = value;
>    }
> 
> Unfortunately, this doesn't work in current jdk after loop strip mining.
> The above loop is eventually unrolled and auto-vectorized by subsequent
> optimization phases. Root cause is that in strip-mined loops, the inner
> CountedLoopNode may be used by the address polling node of the safepoint
> in the outer loop. But as the safepoint polling has nothing related to
> any real operations in the loop, it should not hinder the pattern match.
> So in this patch, the polladr's use is ignored in the match check.
> 
> We have some performance comparison of the code for array fill, between
> the auto-vectorized version and the stub routine version. The JMH case
> for the tests can be found at [1]. Results show that on x86, the stub
> code is even slower than the auto-vectorized code. To prevent any
> regression, vm option OptimizedFill is turned off for x86 in this patch.
> So this patch doesn't impact on the generated code on x86. On AArch64,
> the two versions show almost the same performance in general cases. But
> if the value to be filled is zero, the stub code's performance is much
> better. This makes sence as AArch64 uses cache maintenance instructions
> (DC ZVA) to zero large blocks in the hand-crafted assembly. Below are
> JMH scores on AArch64.
> 
> Before:
>    Benchmark                     Mode  Cnt      Score     Error  Units
>    TestArrayFill.fillByteArray   avgt   25   2078.700 ±   7.719  ns/op
>    TestArrayFill.fillIntArray    avgt   25  12371.497 ± 566.773  ns/op
>    TestArrayFill.fillShortArray  avgt   25   4132.439 ±  25.096  ns/op
>    TestArrayFill.zeroByteArray   avgt   25   2080.313 ±   7.516  ns/op
>    TestArrayFill.zeroIntArray    avgt   25  10961.331 ± 527.750  ns/op
>    TestArrayFill.zeroShortArray  avgt   25   4126.386 ±  20.997  ns/op
> 
> After:
>    Benchmark                     Mode  Cnt      Score     Error  Units
>    TestArrayFill.fillByteArray   avgt   25   2080.382 ±   2.103  ns/op
>    TestArrayFill.fillIntArray    avgt   25  11997.621 ± 569.058  ns/op
>    TestArrayFill.fillShortArray  avgt   25   4309.035 ± 285.456  ns/op
>    TestArrayFill.zeroByteArray   avgt   25    903.434 ±  10.944  ns/op
>    TestArrayFill.zeroIntArray    avgt   25   8141.533 ± 946.341  ns/op
>    TestArrayFill.zeroShortArray  avgt   25   1784.124 ±  24.618  ns/op
> 
> Another advantage of using the stub routine is that the generated code
> size is reduced.
> 
> Jtreg hotspot::hotspot_all_no_apps, jdk::jdk_core, langtools::tier1 are
> tested and no new failure is found.
> 
> --
> Thanks,
> Pengfei
>