RFR(S): 8247307: C2: Loop array fill stub routines are not called

Pengfei Li Pengfei.Li at arm.com
Tue Jun 16 06:24:55 UTC 2020


Sorry I forgot to paste below JMH link in my last email.

[1] http://cr.openjdk.java.net/~pli/rfr/8247307/TestArrayFill.java

BTW. If I turn on OptimizeFill manually there's below performance regression on x86. So I turned it off on x86 in my patch to make things unchanged.

Before (x86 with -XX:+OptimizeFill)
  Benchmark                     Mode  Cnt     Score    Error  Units
  TestArrayFill.fillByteArray   avgt   25  1793.206 ± 15.337  ns/op
  TestArrayFill.fillIntArray    avgt   25  6679.491 ± 14.729  ns/op
  TestArrayFill.fillShortArray  avgt   25  3412.708 ± 12.005  ns/op
  TestArrayFill.zeroByteArray   avgt   25  1785.940 ± 15.174  ns/op
  TestArrayFill.zeroIntArray    avgt   25  6666.709 ± 11.735  ns/op
  TestArrayFill.zeroShortArray  avgt   25  3404.146 ± 23.045  ns/op

After (x86 with -XX:+OptimizeFill)
  Benchmark                     Mode  Cnt     Score     Error  Units
  TestArrayFill.fillByteArray   avgt   25  2281.374 ± 191.220  ns/op
  TestArrayFill.fillIntArray    avgt   25  9009.679 ± 901.541  ns/op
  TestArrayFill.fillShortArray  avgt   25  4828.686 ±  49.199  ns/op
  TestArrayFill.zeroByteArray   avgt   25  2463.745 ±  47.640  ns/op
  TestArrayFill.zeroIntArray    avgt   25  9062.682 ± 939.538  ns/op
  TestArrayFill.zeroShortArray  avgt   25  4837.231 ±  50.026  ns/op

> Hi,
> 
> Can I have a review of this C2 loop optimization fix?
> 
> JBS: https://bugs.openjdk.java.net/browse/JDK-8247307
> Webrev: http://cr.openjdk.java.net/~pli/rfr/8247307/webrev.00/
> 
> C2 has a loop optimization phase called intrinsify_fill. It matches the pattern
> of single array store with an loop invariant in a counted loop, like below, and
> replaces it with call to some stub routine.
> 
>   for (int i = start; i < limit; i++) {
>     a[i] = value;
>   }
> 
> Unfortunately, this doesn't work in current jdk after loop strip mining.
> The above loop is eventually unrolled and auto-vectorized by subsequent
> optimization phases. Root cause is that in strip-mined loops, the inner
> CountedLoopNode may be used by the address polling node of the safepoint
> in the outer loop. But as the safepoint polling has nothing related to any real
> operations in the loop, it should not hinder the pattern match.
> So in this patch, the polladr's use is ignored in the match check.
> 
> We have some performance comparison of the code for array fill, between
> the auto-vectorized version and the stub routine version. The JMH case for
> the tests can be found at [1]. Results show that on x86, the stub code is even
> slower than the auto-vectorized code. To prevent any regression, vm option
> OptimizedFill is turned off for x86 in this patch.
> So this patch doesn't impact on the generated code on x86. On AArch64, the
> two versions show almost the same performance in general cases. But if the
> value to be filled is zero, the stub code's performance is much better. This
> makes sence as AArch64 uses cache maintenance instructions (DC ZVA) to
> zero large blocks in the hand-crafted assembly. Below are JMH scores on
> AArch64.
> 
> Before:
>   Benchmark                     Mode  Cnt      Score     Error  Units
>   TestArrayFill.fillByteArray   avgt   25   2078.700 ±   7.719  ns/op
>   TestArrayFill.fillIntArray    avgt   25  12371.497 ± 566.773  ns/op
>   TestArrayFill.fillShortArray  avgt   25   4132.439 ±  25.096  ns/op
>   TestArrayFill.zeroByteArray   avgt   25   2080.313 ±   7.516  ns/op
>   TestArrayFill.zeroIntArray    avgt   25  10961.331 ± 527.750  ns/op
>   TestArrayFill.zeroShortArray  avgt   25   4126.386 ±  20.997  ns/op
> 
> After:
>   Benchmark                     Mode  Cnt      Score     Error  Units
>   TestArrayFill.fillByteArray   avgt   25   2080.382 ±   2.103  ns/op
>   TestArrayFill.fillIntArray    avgt   25  11997.621 ± 569.058  ns/op
>   TestArrayFill.fillShortArray  avgt   25   4309.035 ± 285.456  ns/op
>   TestArrayFill.zeroByteArray   avgt   25    903.434 ±  10.944  ns/op
>   TestArrayFill.zeroIntArray    avgt   25   8141.533 ± 946.341  ns/op
>   TestArrayFill.zeroShortArray  avgt   25   1784.124 ±  24.618  ns/op
> 
> Another advantage of using the stub routine is that the generated code size is
> reduced.
> 
> Jtreg hotspot::hotspot_all_no_apps, jdk::jdk_core, langtools::tier1 are tested
> and no new failure is found.

Thanks,
Pengfei



More information about the hotspot-compiler-dev mailing list