RFR(S): 8247307: C2: Loop array fill stub routines are not called
Nils Eliasson
nils.eliasson at oracle.com
Tue Jun 16 12:18:56 UTC 2020
Hi,
Also we need to consider code size. The auto-vectorized version is
inlined - and the unrolling may fail or be limited. To fully take
advantage of this we would need to outline the fill-loop (like what's
done for the intrinsic, where the loop is substituted for a call). But
instead of having a handcrafted intrinsic - the call goes to some java
code. To do this we need somewhere to put the java-version of the fill-loop.
Regards,
Nils
Leveraging the auto-vectorization very nice - but
On 2020-06-15 20:41, Vladimir Kozlov wrote:
> Hi,
>
> It would be interesting to see difference with strip-mining off.
> With strip-mining only part of iterations are replaced with stub.
>
> I don't see referenced link [1] in e-mail.
>
> Are these performance data for Aarch64?
>
> What x86 CPU you tested on? (avx512?)
>
> Please, add both x64 and Aarch64 perf data to RFE.
>
> What size of arrays you tested.
>
> Few years ago OptimizedFill wins over vectorized loops but CPU and
> vectorization are improved since then. May be we can deprecate this
> code if it does not have performance benefits. Or we should revisit
> stub's code for modern CPUs.
>
> We need more data.
>
> Thanks,
> Vladimir
>
> On 6/14/20 8:20 PM, Pengfei Li wrote:
>> Hi,
>>
>> Can I have a review of this C2 loop optimization fix?
>>
>> JBS: https://bugs.openjdk.java.net/browse/JDK-8247307
>> Webrev: http://cr.openjdk.java.net/~pli/rfr/8247307/webrev.00/
>>
>> C2 has a loop optimization phase called intrinsify_fill. It matches the
>> pattern of single array store with an loop invariant in a counted loop,
>> like below, and replaces it with call to some stub routine.
>>
>> for (int i = start; i < limit; i++) {
>> a[i] = value;
>> }
>>
>> Unfortunately, this doesn't work in current jdk after loop strip mining.
>> The above loop is eventually unrolled and auto-vectorized by subsequent
>> optimization phases. Root cause is that in strip-mined loops, the inner
>> CountedLoopNode may be used by the address polling node of the safepoint
>> in the outer loop. But as the safepoint polling has nothing related to
>> any real operations in the loop, it should not hinder the pattern match.
>> So in this patch, the polladr's use is ignored in the match check.
>>
>> We have some performance comparison of the code for array fill, between
>> the auto-vectorized version and the stub routine version. The JMH case
>> for the tests can be found at [1]. Results show that on x86, the stub
>> code is even slower than the auto-vectorized code. To prevent any
>> regression, vm option OptimizedFill is turned off for x86 in this patch.
>> So this patch doesn't impact on the generated code on x86. On AArch64,
>> the two versions show almost the same performance in general cases. But
>> if the value to be filled is zero, the stub code's performance is much
>> better. This makes sence as AArch64 uses cache maintenance instructions
>> (DC ZVA) to zero large blocks in the hand-crafted assembly. Below are
>> JMH scores on AArch64.
>>
>> Before:
>> Benchmark Mode Cnt Score Error Units
>> TestArrayFill.fillByteArray avgt 25 2078.700 ± 7.719 ns/op
>> TestArrayFill.fillIntArray avgt 25 12371.497 ± 566.773 ns/op
>> TestArrayFill.fillShortArray avgt 25 4132.439 ± 25.096 ns/op
>> TestArrayFill.zeroByteArray avgt 25 2080.313 ± 7.516 ns/op
>> TestArrayFill.zeroIntArray avgt 25 10961.331 ± 527.750 ns/op
>> TestArrayFill.zeroShortArray avgt 25 4126.386 ± 20.997 ns/op
>>
>> After:
>> Benchmark Mode Cnt Score Error Units
>> TestArrayFill.fillByteArray avgt 25 2080.382 ± 2.103 ns/op
>> TestArrayFill.fillIntArray avgt 25 11997.621 ± 569.058 ns/op
>> TestArrayFill.fillShortArray avgt 25 4309.035 ± 285.456 ns/op
>> TestArrayFill.zeroByteArray avgt 25 903.434 ± 10.944 ns/op
>> TestArrayFill.zeroIntArray avgt 25 8141.533 ± 946.341 ns/op
>> TestArrayFill.zeroShortArray avgt 25 1784.124 ± 24.618 ns/op
>>
>> Another advantage of using the stub routine is that the generated code
>> size is reduced.
>>
>> Jtreg hotspot::hotspot_all_no_apps, jdk::jdk_core, langtools::tier1 are
>> tested and no new failure is found.
>>
>> --
>> Thanks,
>> Pengfei
>>
More information about the hotspot-compiler-dev
mailing list