RFR(S): 8247307: C2: Loop array fill stub routines are not called

Mon Jun 22 05:02:40 UTC 2020

PING. May I have comments from another reviewer? I need a second review.

--
Thanks,
Pengfei

> Sorry I forgot to paste below JMH link in my last email.
> 
> [1] http://cr.openjdk.java.net/~pli/rfr/8247307/TestArrayFill.java
> 
> BTW. If I turn on OptimizeFill manually there's below performance regression
> on x86. So I turned it off on x86 in my patch to make things unchanged.
> 
> Before (x86 with -XX:+OptimizeFill)
>   Benchmark                     Mode  Cnt     Score    Error  Units
>   TestArrayFill.fillByteArray   avgt   25  1793.206 ± 15.337  ns/op
>   TestArrayFill.fillIntArray    avgt   25  6679.491 ± 14.729  ns/op
>   TestArrayFill.fillShortArray  avgt   25  3412.708 ± 12.005  ns/op
>   TestArrayFill.zeroByteArray   avgt   25  1785.940 ± 15.174  ns/op
>   TestArrayFill.zeroIntArray    avgt   25  6666.709 ± 11.735  ns/op
>   TestArrayFill.zeroShortArray  avgt   25  3404.146 ± 23.045  ns/op
> 
> After (x86 with -XX:+OptimizeFill)
>   Benchmark                     Mode  Cnt     Score     Error  Units
>   TestArrayFill.fillByteArray   avgt   25  2281.374 ± 191.220  ns/op
>   TestArrayFill.fillIntArray    avgt   25  9009.679 ± 901.541  ns/op
>   TestArrayFill.fillShortArray  avgt   25  4828.686 ±  49.199  ns/op
>   TestArrayFill.zeroByteArray   avgt   25  2463.745 ±  47.640  ns/op
>   TestArrayFill.zeroIntArray    avgt   25  9062.682 ± 939.538  ns/op
>   TestArrayFill.zeroShortArray  avgt   25  4837.231 ±  50.026  ns/op
> 
> > Hi,
> >
> > Can I have a review of this C2 loop optimization fix?
> >
> > JBS: https://bugs.openjdk.java.net/browse/JDK-8247307
> > Webrev: http://cr.openjdk.java.net/~pli/rfr/8247307/webrev.00/
> >
> > C2 has a loop optimization phase called intrinsify_fill. It matches
> > the pattern of single array store with an loop invariant in a counted
> > loop, like below, and replaces it with call to some stub routine.
> >
> >   for (int i = start; i < limit; i++) {
> >     a[i] = value;
> >   }
> >
> > Unfortunately, this doesn't work in current jdk after loop strip mining.
> > The above loop is eventually unrolled and auto-vectorized by
> > subsequent optimization phases. Root cause is that in strip-mined
> > loops, the inner CountedLoopNode may be used by the address polling
> > node of the safepoint in the outer loop. But as the safepoint polling
> > has nothing related to any real operations in the loop, it should not hinder
> the pattern match.
> > So in this patch, the polladr's use is ignored in the match check.
> >
> > We have some performance comparison of the code for array fill,
> > between the auto-vectorized version and the stub routine version. The
> > JMH case for the tests can be found at [1]. Results show that on x86,
> > the stub code is even slower than the auto-vectorized code. To prevent
> > any regression, vm option OptimizedFill is turned off for x86 in this patch.
> > So this patch doesn't impact on the generated code on x86. On AArch64,
> > the two versions show almost the same performance in general cases.
> > But if the value to be filled is zero, the stub code's performance is
> > much better. This makes sence as AArch64 uses cache maintenance
> > instructions (DC ZVA) to zero large blocks in the hand-crafted
> > assembly. Below are JMH scores on AArch64.
> >
> > Before:
> >   Benchmark                     Mode  Cnt      Score     Error  Units
> >   TestArrayFill.fillByteArray   avgt   25   2078.700 ±   7.719  ns/op
> >   TestArrayFill.fillIntArray    avgt   25  12371.497 ± 566.773  ns/op
> >   TestArrayFill.fillShortArray  avgt   25   4132.439 ±  25.096  ns/op
> >   TestArrayFill.zeroByteArray   avgt   25   2080.313 ±   7.516  ns/op
> >   TestArrayFill.zeroIntArray    avgt   25  10961.331 ± 527.750  ns/op
> >   TestArrayFill.zeroShortArray  avgt   25   4126.386 ±  20.997  ns/op
> >
> > After:
> >   Benchmark                     Mode  Cnt      Score     Error  Units
> >   TestArrayFill.fillByteArray   avgt   25   2080.382 ±   2.103  ns/op
> >   TestArrayFill.fillIntArray    avgt   25  11997.621 ± 569.058  ns/op
> >   TestArrayFill.fillShortArray  avgt   25   4309.035 ± 285.456  ns/op
> >   TestArrayFill.zeroByteArray   avgt   25    903.434 ±  10.944  ns/op
> >   TestArrayFill.zeroIntArray    avgt   25   8141.533 ± 946.341  ns/op
> >   TestArrayFill.zeroShortArray  avgt   25   1784.124 ±  24.618  ns/op
> >
> > Another advantage of using the stub routine is that the generated code
> > size is reduced.
> >
> > Jtreg hotspot::hotspot_all_no_apps, jdk::jdk_core, langtools::tier1
> > are tested and no new failure is found.
> 
> Thanks,
> Pengfei