RFR(S): 8247307: C2: Loop array fill stub routines are not called

Tue Jun 23 07:28:15 UTC 2020

Hi Pengfei,

> http://cr.openjdk.java.net/~pli/rfr/8247307/webrev.00/

I don't understand how the change in loopTransform.cpp is supposed to work. Shouldn't we bail out if
use != polladr?

Also please add a comment to vm_version_x86.cpp explaining why this optimization is currently disabled.

Could you add the JMH benchmark to test/micro/org/openjdk/bench/?

Thanks,
Tobias

On 22.06.20 07:02, Pengfei Li wrote:
> PING. May I have comments from another reviewer? I need a second review.
> 
> --
> Thanks,
> Pengfei
> 
>> Sorry I forgot to paste below JMH link in my last email.
>>
>> [1] http://cr.openjdk.java.net/~pli/rfr/8247307/TestArrayFill.java
>>
>> BTW. If I turn on OptimizeFill manually there's below performance regression
>> on x86. So I turned it off on x86 in my patch to make things unchanged.
>>
>> Before (x86 with -XX:+OptimizeFill)
>>   Benchmark                     Mode  Cnt     Score    Error  Units
>>   TestArrayFill.fillByteArray   avgt   25  1793.206 ± 15.337  ns/op
>>   TestArrayFill.fillIntArray    avgt   25  6679.491 ± 14.729  ns/op
>>   TestArrayFill.fillShortArray  avgt   25  3412.708 ± 12.005  ns/op
>>   TestArrayFill.zeroByteArray   avgt   25  1785.940 ± 15.174  ns/op
>>   TestArrayFill.zeroIntArray    avgt   25  6666.709 ± 11.735  ns/op
>>   TestArrayFill.zeroShortArray  avgt   25  3404.146 ± 23.045  ns/op
>>
>> After (x86 with -XX:+OptimizeFill)
>>   Benchmark                     Mode  Cnt     Score     Error  Units
>>   TestArrayFill.fillByteArray   avgt   25  2281.374 ± 191.220  ns/op
>>   TestArrayFill.fillIntArray    avgt   25  9009.679 ± 901.541  ns/op
>>   TestArrayFill.fillShortArray  avgt   25  4828.686 ±  49.199  ns/op
>>   TestArrayFill.zeroByteArray   avgt   25  2463.745 ±  47.640  ns/op
>>   TestArrayFill.zeroIntArray    avgt   25  9062.682 ± 939.538  ns/op
>>   TestArrayFill.zeroShortArray  avgt   25  4837.231 ±  50.026  ns/op
>>
>>> Hi,
>>>
>>> Can I have a review of this C2 loop optimization fix?
>>>
>>> JBS: https://bugs.openjdk.java.net/browse/JDK-8247307
>>> Webrev: http://cr.openjdk.java.net/~pli/rfr/8247307/webrev.00/
>>>
>>> C2 has a loop optimization phase called intrinsify_fill. It matches
>>> the pattern of single array store with an loop invariant in a counted
>>> loop, like below, and replaces it with call to some stub routine.
>>>
>>>   for (int i = start; i < limit; i++) {
>>>     a[i] = value;
>>>   }
>>>
>>> Unfortunately, this doesn't work in current jdk after loop strip mining.
>>> The above loop is eventually unrolled and auto-vectorized by
>>> subsequent optimization phases. Root cause is that in strip-mined
>>> loops, the inner CountedLoopNode may be used by the address polling
>>> node of the safepoint in the outer loop. But as the safepoint polling
>>> has nothing related to any real operations in the loop, it should not hinder
>> the pattern match.
>>> So in this patch, the polladr's use is ignored in the match check.
>>>
>>> We have some performance comparison of the code for array fill,
>>> between the auto-vectorized version and the stub routine version. The
>>> JMH case for the tests can be found at [1]. Results show that on x86,
>>> the stub code is even slower than the auto-vectorized code. To prevent
>>> any regression, vm option OptimizedFill is turned off for x86 in this patch.
>>> So this patch doesn't impact on the generated code on x86. On AArch64,
>>> the two versions show almost the same performance in general cases.
>>> But if the value to be filled is zero, the stub code's performance is
>>> much better. This makes sence as AArch64 uses cache maintenance
>>> instructions (DC ZVA) to zero large blocks in the hand-crafted
>>> assembly. Below are JMH scores on AArch64.
>>>
>>> Before:
>>>   Benchmark                     Mode  Cnt      Score     Error  Units
>>>   TestArrayFill.fillByteArray   avgt   25   2078.700 ±   7.719  ns/op
>>>   TestArrayFill.fillIntArray    avgt   25  12371.497 ± 566.773  ns/op
>>>   TestArrayFill.fillShortArray  avgt   25   4132.439 ±  25.096  ns/op
>>>   TestArrayFill.zeroByteArray   avgt   25   2080.313 ±   7.516  ns/op
>>>   TestArrayFill.zeroIntArray    avgt   25  10961.331 ± 527.750  ns/op
>>>   TestArrayFill.zeroShortArray  avgt   25   4126.386 ±  20.997  ns/op
>>>
>>> After:
>>>   Benchmark                     Mode  Cnt      Score     Error  Units
>>>   TestArrayFill.fillByteArray   avgt   25   2080.382 ±   2.103  ns/op
>>>   TestArrayFill.fillIntArray    avgt   25  11997.621 ± 569.058  ns/op
>>>   TestArrayFill.fillShortArray  avgt   25   4309.035 ± 285.456  ns/op
>>>   TestArrayFill.zeroByteArray   avgt   25    903.434 ±  10.944  ns/op
>>>   TestArrayFill.zeroIntArray    avgt   25   8141.533 ± 946.341  ns/op
>>>   TestArrayFill.zeroShortArray  avgt   25   1784.124 ±  24.618  ns/op
>>>
>>> Another advantage of using the stub routine is that the generated code
>>> size is reduced.
>>>
>>> Jtreg hotspot::hotspot_all_no_apps, jdk::jdk_core, langtools::tier1
>>> are tested and no new failure is found.
>>
>> Thanks,
>> Pengfei
>