RFR(S): 8247307: C2: Loop array fill stub routines are not called

Mon Jun 15 03:20:53 UTC 2020

Hi,

Can I have a review of this C2 loop optimization fix?

JBS: https://bugs.openjdk.java.net/browse/JDK-8247307
Webrev: http://cr.openjdk.java.net/~pli/rfr/8247307/webrev.00/

C2 has a loop optimization phase called intrinsify_fill. It matches the
pattern of single array store with an loop invariant in a counted loop,
like below, and replaces it with call to some stub routine.

  for (int i = start; i < limit; i++) {
    a[i] = value;
  }

Unfortunately, this doesn't work in current jdk after loop strip mining.
The above loop is eventually unrolled and auto-vectorized by subsequent
optimization phases. Root cause is that in strip-mined loops, the inner
CountedLoopNode may be used by the address polling node of the safepoint
in the outer loop. But as the safepoint polling has nothing related to
any real operations in the loop, it should not hinder the pattern match.
So in this patch, the polladr's use is ignored in the match check.

We have some performance comparison of the code for array fill, between
the auto-vectorized version and the stub routine version. The JMH case
for the tests can be found at [1]. Results show that on x86, the stub
code is even slower than the auto-vectorized code. To prevent any
regression, vm option OptimizedFill is turned off for x86 in this patch.
So this patch doesn't impact on the generated code on x86. On AArch64,
the two versions show almost the same performance in general cases. But
if the value to be filled is zero, the stub code's performance is much
better. This makes sence as AArch64 uses cache maintenance instructions
(DC ZVA) to zero large blocks in the hand-crafted assembly. Below are
JMH scores on AArch64.

Before:
  Benchmark                     Mode  Cnt      Score     Error  Units
  TestArrayFill.fillByteArray   avgt   25   2078.700 ±   7.719  ns/op
  TestArrayFill.fillIntArray    avgt   25  12371.497 ± 566.773  ns/op
  TestArrayFill.fillShortArray  avgt   25   4132.439 ±  25.096  ns/op
  TestArrayFill.zeroByteArray   avgt   25   2080.313 ±   7.516  ns/op
  TestArrayFill.zeroIntArray    avgt   25  10961.331 ± 527.750  ns/op
  TestArrayFill.zeroShortArray  avgt   25   4126.386 ±  20.997  ns/op

After:
  Benchmark                     Mode  Cnt      Score     Error  Units
  TestArrayFill.fillByteArray   avgt   25   2080.382 ±   2.103  ns/op
  TestArrayFill.fillIntArray    avgt   25  11997.621 ± 569.058  ns/op
  TestArrayFill.fillShortArray  avgt   25   4309.035 ± 285.456  ns/op
  TestArrayFill.zeroByteArray   avgt   25    903.434 ±  10.944  ns/op
  TestArrayFill.zeroIntArray    avgt   25   8141.533 ± 946.341  ns/op
  TestArrayFill.zeroShortArray  avgt   25   1784.124 ±  24.618  ns/op

Another advantage of using the stub routine is that the generated code
size is reduced.

Jtreg hotspot::hotspot_all_no_apps, jdk::jdk_core, langtools::tier1 are
tested and no new failure is found.

--
Thanks,
Pengfei