RFR(S): 8247307: C2: Loop array fill stub routines are not called
Tobias Hartmann
tobias.hartmann at oracle.com
Tue Jun 23 07:28:15 UTC 2020
Hi Pengfei,
> http://cr.openjdk.java.net/~pli/rfr/8247307/webrev.00/
I don't understand how the change in loopTransform.cpp is supposed to work. Shouldn't we bail out if
use != polladr?
Also please add a comment to vm_version_x86.cpp explaining why this optimization is currently disabled.
Could you add the JMH benchmark to test/micro/org/openjdk/bench/?
Thanks,
Tobias
On 22.06.20 07:02, Pengfei Li wrote:
> PING. May I have comments from another reviewer? I need a second review.
>
> --
> Thanks,
> Pengfei
>
>> Sorry I forgot to paste below JMH link in my last email.
>>
>> [1] http://cr.openjdk.java.net/~pli/rfr/8247307/TestArrayFill.java
>>
>> BTW. If I turn on OptimizeFill manually there's below performance regression
>> on x86. So I turned it off on x86 in my patch to make things unchanged.
>>
>> Before (x86 with -XX:+OptimizeFill)
>> Benchmark Mode Cnt Score Error Units
>> TestArrayFill.fillByteArray avgt 25 1793.206 ± 15.337 ns/op
>> TestArrayFill.fillIntArray avgt 25 6679.491 ± 14.729 ns/op
>> TestArrayFill.fillShortArray avgt 25 3412.708 ± 12.005 ns/op
>> TestArrayFill.zeroByteArray avgt 25 1785.940 ± 15.174 ns/op
>> TestArrayFill.zeroIntArray avgt 25 6666.709 ± 11.735 ns/op
>> TestArrayFill.zeroShortArray avgt 25 3404.146 ± 23.045 ns/op
>>
>> After (x86 with -XX:+OptimizeFill)
>> Benchmark Mode Cnt Score Error Units
>> TestArrayFill.fillByteArray avgt 25 2281.374 ± 191.220 ns/op
>> TestArrayFill.fillIntArray avgt 25 9009.679 ± 901.541 ns/op
>> TestArrayFill.fillShortArray avgt 25 4828.686 ± 49.199 ns/op
>> TestArrayFill.zeroByteArray avgt 25 2463.745 ± 47.640 ns/op
>> TestArrayFill.zeroIntArray avgt 25 9062.682 ± 939.538 ns/op
>> TestArrayFill.zeroShortArray avgt 25 4837.231 ± 50.026 ns/op
>>
>>> Hi,
>>>
>>> Can I have a review of this C2 loop optimization fix?
>>>
>>> JBS: https://bugs.openjdk.java.net/browse/JDK-8247307
>>> Webrev: http://cr.openjdk.java.net/~pli/rfr/8247307/webrev.00/
>>>
>>> C2 has a loop optimization phase called intrinsify_fill. It matches
>>> the pattern of single array store with an loop invariant in a counted
>>> loop, like below, and replaces it with call to some stub routine.
>>>
>>> for (int i = start; i < limit; i++) {
>>> a[i] = value;
>>> }
>>>
>>> Unfortunately, this doesn't work in current jdk after loop strip mining.
>>> The above loop is eventually unrolled and auto-vectorized by
>>> subsequent optimization phases. Root cause is that in strip-mined
>>> loops, the inner CountedLoopNode may be used by the address polling
>>> node of the safepoint in the outer loop. But as the safepoint polling
>>> has nothing related to any real operations in the loop, it should not hinder
>> the pattern match.
>>> So in this patch, the polladr's use is ignored in the match check.
>>>
>>> We have some performance comparison of the code for array fill,
>>> between the auto-vectorized version and the stub routine version. The
>>> JMH case for the tests can be found at [1]. Results show that on x86,
>>> the stub code is even slower than the auto-vectorized code. To prevent
>>> any regression, vm option OptimizedFill is turned off for x86 in this patch.
>>> So this patch doesn't impact on the generated code on x86. On AArch64,
>>> the two versions show almost the same performance in general cases.
>>> But if the value to be filled is zero, the stub code's performance is
>>> much better. This makes sence as AArch64 uses cache maintenance
>>> instructions (DC ZVA) to zero large blocks in the hand-crafted
>>> assembly. Below are JMH scores on AArch64.
>>>
>>> Before:
>>> Benchmark Mode Cnt Score Error Units
>>> TestArrayFill.fillByteArray avgt 25 2078.700 ± 7.719 ns/op
>>> TestArrayFill.fillIntArray avgt 25 12371.497 ± 566.773 ns/op
>>> TestArrayFill.fillShortArray avgt 25 4132.439 ± 25.096 ns/op
>>> TestArrayFill.zeroByteArray avgt 25 2080.313 ± 7.516 ns/op
>>> TestArrayFill.zeroIntArray avgt 25 10961.331 ± 527.750 ns/op
>>> TestArrayFill.zeroShortArray avgt 25 4126.386 ± 20.997 ns/op
>>>
>>> After:
>>> Benchmark Mode Cnt Score Error Units
>>> TestArrayFill.fillByteArray avgt 25 2080.382 ± 2.103 ns/op
>>> TestArrayFill.fillIntArray avgt 25 11997.621 ± 569.058 ns/op
>>> TestArrayFill.fillShortArray avgt 25 4309.035 ± 285.456 ns/op
>>> TestArrayFill.zeroByteArray avgt 25 903.434 ± 10.944 ns/op
>>> TestArrayFill.zeroIntArray avgt 25 8141.533 ± 946.341 ns/op
>>> TestArrayFill.zeroShortArray avgt 25 1784.124 ± 24.618 ns/op
>>>
>>> Another advantage of using the stub routine is that the generated code
>>> size is reduced.
>>>
>>> Jtreg hotspot::hotspot_all_no_apps, jdk::jdk_core, langtools::tier1
>>> are tested and no new failure is found.
>>
>> Thanks,
>> Pengfei
>
More information about the hotspot-compiler-dev
mailing list