RFR(S): 8247307: C2: Loop array fill stub routines are not called

Tue Jun 16 22:34:06 UTC 2020

Hi Pengfei,

I ran your benchmark on my machine (only 3 iterations).

First, I was confused by numbers because I assumed that bigger is better. But it is opposite.

Second, I thought it may be interaction with strip mining. By default LoopStripMiningIter is set to 1000 [1]. So inner 
loop in your tests will execute only 65 iterations - that is why vectorization (inlined instructions) wins.

Then I reduced outer iterations to 100 to have inner 655 iterations. It did not help.

And finally I tried switch off strip mining by -XX:-UseCountedLoopSafepoints or using Parallel GC. And again had the 
same results:

-XX:+UseParallelGC

TestArrayFill.fillByteArray   avgt    3  2164.076 ± 168.821  ns/op
TestArrayFill.fillIntArray    avgt    3  8600.738 ± 954.559  ns/op
TestArrayFill.fillShortArray  avgt    3  4488.062 ± 217.501  ns/op
TestArrayFill.zeroByteArray   avgt    3  2167.487 ± 373.366  ns/op
TestArrayFill.zeroIntArray    avgt    3  8595.717 ± 579.696  ns/op
TestArrayFill.zeroShortArray  avgt    3  4482.645 ±  44.031  ns/op

-XX:+UseParallelGC -XX:-OptimizeFill

TestArrayFill.fillByteArray   avgt    3  1586.719 ±  87.300  ns/op
TestArrayFill.fillIntArray    avgt    3  5879.356 ±  34.836  ns/op
TestArrayFill.fillShortArray  avgt    3  3045.436 ±  41.981  ns/op
TestArrayFill.zeroByteArray   avgt    3  1513.536 ± 738.573  ns/op
TestArrayFill.zeroIntArray    avgt    3  5911.524 ± 172.335  ns/op
TestArrayFill.zeroShortArray  avgt    3  3053.304 ±  50.365  ns/op

Looking on generated code I see that vectorized loop may unroll 16 times (16 vector instructions by 256 bytes) where 
generate_fill() stub on x86 has 2 (256 bytes wide) instructions per iteration and 1 instruction for avx512 [2].
Also stub has alot of pre- and post-loop instructions and checks.

I thought may be we can improve stub. But it seems vectorized loop with predicates is more compact and efficient. And it 
is auto generated!

Base on results I agree with you switching off fill optimization on x86.

There could be side effects due to loops code will be larger (vs stub call) but we have it already right now before your 
changes so I don't think we will see regression for GCs which use strip mining.

Thanks,
Vladimir

[1] http://hg.openjdk.java.net/jdk/jdk/file/3585f92edcaa/src/hotspot/share/gc/g1/g1Arguments.cpp#l183
[2] http://hg.openjdk.java.net/jdk/jdk/file/3585f92edcaa/src/hotspot/cpu/x86/macroAssembler_x86.cpp#l5023

On 6/15/20 11:24 PM, Pengfei Li wrote:
> Sorry I forgot to paste below JMH link in my last email.
> 
> [1] http://cr.openjdk.java.net/~pli/rfr/8247307/TestArrayFill.java
> 
> BTW. If I turn on OptimizeFill manually there's below performance regression on x86. So I turned it off on x86 in my patch to make things unchanged.
> 
> Before (x86 with -XX:+OptimizeFill)
>    Benchmark                     Mode  Cnt     Score    Error  Units
>    TestArrayFill.fillByteArray   avgt   25  1793.206 ± 15.337  ns/op
>    TestArrayFill.fillIntArray    avgt   25  6679.491 ± 14.729  ns/op
>    TestArrayFill.fillShortArray  avgt   25  3412.708 ± 12.005  ns/op
>    TestArrayFill.zeroByteArray   avgt   25  1785.940 ± 15.174  ns/op
>    TestArrayFill.zeroIntArray    avgt   25  6666.709 ± 11.735  ns/op
>    TestArrayFill.zeroShortArray  avgt   25  3404.146 ± 23.045  ns/op
> 
> After (x86 with -XX:+OptimizeFill)
>    Benchmark                     Mode  Cnt     Score     Error  Units
>    TestArrayFill.fillByteArray   avgt   25  2281.374 ± 191.220  ns/op
>    TestArrayFill.fillIntArray    avgt   25  9009.679 ± 901.541  ns/op
>    TestArrayFill.fillShortArray  avgt   25  4828.686 ±  49.199  ns/op
>    TestArrayFill.zeroByteArray   avgt   25  2463.745 ±  47.640  ns/op
>    TestArrayFill.zeroIntArray    avgt   25  9062.682 ± 939.538  ns/op
>    TestArrayFill.zeroShortArray  avgt   25  4837.231 ±  50.026  ns/op
> 
>> Hi,
>>
>> Can I have a review of this C2 loop optimization fix?
>>
>> JBS: https://bugs.openjdk.java.net/browse/JDK-8247307
>> Webrev: http://cr.openjdk.java.net/~pli/rfr/8247307/webrev.00/
>>
>> C2 has a loop optimization phase called intrinsify_fill. It matches the pattern
>> of single array store with an loop invariant in a counted loop, like below, and
>> replaces it with call to some stub routine.
>>
>>    for (int i = start; i < limit; i++) {
>>      a[i] = value;
>>    }
>>
>> Unfortunately, this doesn't work in current jdk after loop strip mining.
>> The above loop is eventually unrolled and auto-vectorized by subsequent
>> optimization phases. Root cause is that in strip-mined loops, the inner
>> CountedLoopNode may be used by the address polling node of the safepoint
>> in the outer loop. But as the safepoint polling has nothing related to any real
>> operations in the loop, it should not hinder the pattern match.
>> So in this patch, the polladr's use is ignored in the match check.
>>
>> We have some performance comparison of the code for array fill, between
>> the auto-vectorized version and the stub routine version. The JMH case for
>> the tests can be found at [1]. Results show that on x86, the stub code is even
>> slower than the auto-vectorized code. To prevent any regression, vm option
>> OptimizedFill is turned off for x86 in this patch.
>> So this patch doesn't impact on the generated code on x86. On AArch64, the
>> two versions show almost the same performance in general cases. But if the
>> value to be filled is zero, the stub code's performance is much better. This
>> makes sence as AArch64 uses cache maintenance instructions (DC ZVA) to
>> zero large blocks in the hand-crafted assembly. Below are JMH scores on
>> AArch64.
>>
>> Before:
>>    Benchmark                     Mode  Cnt      Score     Error  Units
>>    TestArrayFill.fillByteArray   avgt   25   2078.700 ±   7.719  ns/op
>>    TestArrayFill.fillIntArray    avgt   25  12371.497 ± 566.773  ns/op
>>    TestArrayFill.fillShortArray  avgt   25   4132.439 ±  25.096  ns/op
>>    TestArrayFill.zeroByteArray   avgt   25   2080.313 ±   7.516  ns/op
>>    TestArrayFill.zeroIntArray    avgt   25  10961.331 ± 527.750  ns/op
>>    TestArrayFill.zeroShortArray  avgt   25   4126.386 ±  20.997  ns/op
>>
>> After:
>>    Benchmark                     Mode  Cnt      Score     Error  Units
>>    TestArrayFill.fillByteArray   avgt   25   2080.382 ±   2.103  ns/op
>>    TestArrayFill.fillIntArray    avgt   25  11997.621 ± 569.058  ns/op
>>    TestArrayFill.fillShortArray  avgt   25   4309.035 ± 285.456  ns/op
>>    TestArrayFill.zeroByteArray   avgt   25    903.434 ±  10.944  ns/op
>>    TestArrayFill.zeroIntArray    avgt   25   8141.533 ± 946.341  ns/op
>>    TestArrayFill.zeroShortArray  avgt   25   1784.124 ±  24.618  ns/op
>>
>> Another advantage of using the stub routine is that the generated code size is
>> reduced.
>>
>> Jtreg hotspot::hotspot_all_no_apps, jdk::jdk_core, langtools::tier1 are tested
>> and no new failure is found.
> 
> Thanks,
> Pengfei
>