RFR(S): 8247307: C2: Loop array fill stub routines are not called
Vladimir Kozlov
vladimir.kozlov at oracle.com
Tue Jun 16 22:34:06 UTC 2020
Hi Pengfei,
I ran your benchmark on my machine (only 3 iterations).
First, I was confused by numbers because I assumed that bigger is better. But it is opposite.
Second, I thought it may be interaction with strip mining. By default LoopStripMiningIter is set to 1000 [1]. So inner
loop in your tests will execute only 65 iterations - that is why vectorization (inlined instructions) wins.
Then I reduced outer iterations to 100 to have inner 655 iterations. It did not help.
And finally I tried switch off strip mining by -XX:-UseCountedLoopSafepoints or using Parallel GC. And again had the
same results:
-XX:+UseParallelGC
TestArrayFill.fillByteArray avgt 3 2164.076 ± 168.821 ns/op
TestArrayFill.fillIntArray avgt 3 8600.738 ± 954.559 ns/op
TestArrayFill.fillShortArray avgt 3 4488.062 ± 217.501 ns/op
TestArrayFill.zeroByteArray avgt 3 2167.487 ± 373.366 ns/op
TestArrayFill.zeroIntArray avgt 3 8595.717 ± 579.696 ns/op
TestArrayFill.zeroShortArray avgt 3 4482.645 ± 44.031 ns/op
-XX:+UseParallelGC -XX:-OptimizeFill
TestArrayFill.fillByteArray avgt 3 1586.719 ± 87.300 ns/op
TestArrayFill.fillIntArray avgt 3 5879.356 ± 34.836 ns/op
TestArrayFill.fillShortArray avgt 3 3045.436 ± 41.981 ns/op
TestArrayFill.zeroByteArray avgt 3 1513.536 ± 738.573 ns/op
TestArrayFill.zeroIntArray avgt 3 5911.524 ± 172.335 ns/op
TestArrayFill.zeroShortArray avgt 3 3053.304 ± 50.365 ns/op
Looking on generated code I see that vectorized loop may unroll 16 times (16 vector instructions by 256 bytes) where
generate_fill() stub on x86 has 2 (256 bytes wide) instructions per iteration and 1 instruction for avx512 [2].
Also stub has alot of pre- and post-loop instructions and checks.
I thought may be we can improve stub. But it seems vectorized loop with predicates is more compact and efficient. And it
is auto generated!
Base on results I agree with you switching off fill optimization on x86.
There could be side effects due to loops code will be larger (vs stub call) but we have it already right now before your
changes so I don't think we will see regression for GCs which use strip mining.
Thanks,
Vladimir
[1] http://hg.openjdk.java.net/jdk/jdk/file/3585f92edcaa/src/hotspot/share/gc/g1/g1Arguments.cpp#l183
[2] http://hg.openjdk.java.net/jdk/jdk/file/3585f92edcaa/src/hotspot/cpu/x86/macroAssembler_x86.cpp#l5023
On 6/15/20 11:24 PM, Pengfei Li wrote:
> Sorry I forgot to paste below JMH link in my last email.
>
> [1] http://cr.openjdk.java.net/~pli/rfr/8247307/TestArrayFill.java
>
> BTW. If I turn on OptimizeFill manually there's below performance regression on x86. So I turned it off on x86 in my patch to make things unchanged.
>
> Before (x86 with -XX:+OptimizeFill)
> Benchmark Mode Cnt Score Error Units
> TestArrayFill.fillByteArray avgt 25 1793.206 ± 15.337 ns/op
> TestArrayFill.fillIntArray avgt 25 6679.491 ± 14.729 ns/op
> TestArrayFill.fillShortArray avgt 25 3412.708 ± 12.005 ns/op
> TestArrayFill.zeroByteArray avgt 25 1785.940 ± 15.174 ns/op
> TestArrayFill.zeroIntArray avgt 25 6666.709 ± 11.735 ns/op
> TestArrayFill.zeroShortArray avgt 25 3404.146 ± 23.045 ns/op
>
> After (x86 with -XX:+OptimizeFill)
> Benchmark Mode Cnt Score Error Units
> TestArrayFill.fillByteArray avgt 25 2281.374 ± 191.220 ns/op
> TestArrayFill.fillIntArray avgt 25 9009.679 ± 901.541 ns/op
> TestArrayFill.fillShortArray avgt 25 4828.686 ± 49.199 ns/op
> TestArrayFill.zeroByteArray avgt 25 2463.745 ± 47.640 ns/op
> TestArrayFill.zeroIntArray avgt 25 9062.682 ± 939.538 ns/op
> TestArrayFill.zeroShortArray avgt 25 4837.231 ± 50.026 ns/op
>
>> Hi,
>>
>> Can I have a review of this C2 loop optimization fix?
>>
>> JBS: https://bugs.openjdk.java.net/browse/JDK-8247307
>> Webrev: http://cr.openjdk.java.net/~pli/rfr/8247307/webrev.00/
>>
>> C2 has a loop optimization phase called intrinsify_fill. It matches the pattern
>> of single array store with an loop invariant in a counted loop, like below, and
>> replaces it with call to some stub routine.
>>
>> for (int i = start; i < limit; i++) {
>> a[i] = value;
>> }
>>
>> Unfortunately, this doesn't work in current jdk after loop strip mining.
>> The above loop is eventually unrolled and auto-vectorized by subsequent
>> optimization phases. Root cause is that in strip-mined loops, the inner
>> CountedLoopNode may be used by the address polling node of the safepoint
>> in the outer loop. But as the safepoint polling has nothing related to any real
>> operations in the loop, it should not hinder the pattern match.
>> So in this patch, the polladr's use is ignored in the match check.
>>
>> We have some performance comparison of the code for array fill, between
>> the auto-vectorized version and the stub routine version. The JMH case for
>> the tests can be found at [1]. Results show that on x86, the stub code is even
>> slower than the auto-vectorized code. To prevent any regression, vm option
>> OptimizedFill is turned off for x86 in this patch.
>> So this patch doesn't impact on the generated code on x86. On AArch64, the
>> two versions show almost the same performance in general cases. But if the
>> value to be filled is zero, the stub code's performance is much better. This
>> makes sence as AArch64 uses cache maintenance instructions (DC ZVA) to
>> zero large blocks in the hand-crafted assembly. Below are JMH scores on
>> AArch64.
>>
>> Before:
>> Benchmark Mode Cnt Score Error Units
>> TestArrayFill.fillByteArray avgt 25 2078.700 ± 7.719 ns/op
>> TestArrayFill.fillIntArray avgt 25 12371.497 ± 566.773 ns/op
>> TestArrayFill.fillShortArray avgt 25 4132.439 ± 25.096 ns/op
>> TestArrayFill.zeroByteArray avgt 25 2080.313 ± 7.516 ns/op
>> TestArrayFill.zeroIntArray avgt 25 10961.331 ± 527.750 ns/op
>> TestArrayFill.zeroShortArray avgt 25 4126.386 ± 20.997 ns/op
>>
>> After:
>> Benchmark Mode Cnt Score Error Units
>> TestArrayFill.fillByteArray avgt 25 2080.382 ± 2.103 ns/op
>> TestArrayFill.fillIntArray avgt 25 11997.621 ± 569.058 ns/op
>> TestArrayFill.fillShortArray avgt 25 4309.035 ± 285.456 ns/op
>> TestArrayFill.zeroByteArray avgt 25 903.434 ± 10.944 ns/op
>> TestArrayFill.zeroIntArray avgt 25 8141.533 ± 946.341 ns/op
>> TestArrayFill.zeroShortArray avgt 25 1784.124 ± 24.618 ns/op
>>
>> Another advantage of using the stub routine is that the generated code size is
>> reduced.
>>
>> Jtreg hotspot::hotspot_all_no_apps, jdk::jdk_core, langtools::tier1 are tested
>> and no new failure is found.
>
> Thanks,
> Pengfei
>
More information about the hotspot-compiler-dev
mailing list