RFR(S): 8247307: C2: Loop array fill stub routines are not called

Tue Jun 16 12:18:56 UTC 2020

Hi,

Also we need to consider code size. The auto-vectorized version is 
inlined - and the unrolling may fail or be limited. To fully take 
advantage of this we would need to outline the fill-loop (like what's 
done for the intrinsic, where the loop is substituted for a call). But 
instead of having a handcrafted intrinsic - the call goes to some java 
code. To do this we need somewhere to put the java-version of the fill-loop.

Regards,
Nils

Leveraging the auto-vectorization very nice - but

On 2020-06-15 20:41, Vladimir Kozlov wrote:
> Hi,
>
> It would be interesting to see difference with strip-mining off.
> With strip-mining only part of iterations are replaced with stub.
>
> I don't see referenced link [1] in e-mail.
>
> Are these performance data for Aarch64?
>
> What x86 CPU you tested on? (avx512?)
>
> Please, add both x64 and Aarch64 perf data to RFE.
>
> What size of arrays you tested.
>
> Few years ago OptimizedFill wins over vectorized loops but CPU and 
> vectorization are improved since then. May be we can deprecate this 
> code if it does not have performance benefits. Or we should revisit 
> stub's code for modern CPUs.
>
> We need more data.
>
> Thanks,
> Vladimir
>
> On 6/14/20 8:20 PM, Pengfei Li wrote:
>> Hi,
>>
>> Can I have a review of this C2 loop optimization fix?
>>
>> JBS: https://bugs.openjdk.java.net/browse/JDK-8247307
>> Webrev: http://cr.openjdk.java.net/~pli/rfr/8247307/webrev.00/
>>
>> C2 has a loop optimization phase called intrinsify_fill. It matches the
>> pattern of single array store with an loop invariant in a counted loop,
>> like below, and replaces it with call to some stub routine.
>>
>>    for (int i = start; i < limit; i++) {
>>      a[i] = value;
>>    }
>>
>> Unfortunately, this doesn't work in current jdk after loop strip mining.
>> The above loop is eventually unrolled and auto-vectorized by subsequent
>> optimization phases. Root cause is that in strip-mined loops, the inner
>> CountedLoopNode may be used by the address polling node of the safepoint
>> in the outer loop. But as the safepoint polling has nothing related to
>> any real operations in the loop, it should not hinder the pattern match.
>> So in this patch, the polladr's use is ignored in the match check.
>>
>> We have some performance comparison of the code for array fill, between
>> the auto-vectorized version and the stub routine version. The JMH case
>> for the tests can be found at [1]. Results show that on x86, the stub
>> code is even slower than the auto-vectorized code. To prevent any
>> regression, vm option OptimizedFill is turned off for x86 in this patch.
>> So this patch doesn't impact on the generated code on x86. On AArch64,
>> the two versions show almost the same performance in general cases. But
>> if the value to be filled is zero, the stub code's performance is much
>> better. This makes sence as AArch64 uses cache maintenance instructions
>> (DC ZVA) to zero large blocks in the hand-crafted assembly. Below are
>> JMH scores on AArch64.
>>
>> Before:
>>    Benchmark                     Mode  Cnt      Score     Error Units
>>    TestArrayFill.fillByteArray   avgt   25   2078.700 ±   7.719 ns/op
>>    TestArrayFill.fillIntArray    avgt   25  12371.497 ± 566.773 ns/op
>>    TestArrayFill.fillShortArray  avgt   25   4132.439 ±  25.096 ns/op
>>    TestArrayFill.zeroByteArray   avgt   25   2080.313 ±   7.516 ns/op
>>    TestArrayFill.zeroIntArray    avgt   25  10961.331 ± 527.750 ns/op
>>    TestArrayFill.zeroShortArray  avgt   25   4126.386 ±  20.997 ns/op
>>
>> After:
>>    Benchmark                     Mode  Cnt      Score     Error Units
>>    TestArrayFill.fillByteArray   avgt   25   2080.382 ±   2.103 ns/op
>>    TestArrayFill.fillIntArray    avgt   25  11997.621 ± 569.058 ns/op
>>    TestArrayFill.fillShortArray  avgt   25   4309.035 ± 285.456 ns/op
>>    TestArrayFill.zeroByteArray   avgt   25    903.434 ±  10.944 ns/op
>>    TestArrayFill.zeroIntArray    avgt   25   8141.533 ± 946.341 ns/op
>>    TestArrayFill.zeroShortArray  avgt   25   1784.124 ±  24.618 ns/op
>>
>> Another advantage of using the stub routine is that the generated code
>> size is reduced.
>>
>> Jtreg hotspot::hotspot_all_no_apps, jdk::jdk_core, langtools::tier1 are
>> tested and no new failure is found.
>>
>> -- 
>> Thanks,
>> Pengfei
>>