RFR: 8279258: Auto-vectorization enhancement for two-dimensional array operations [v5]
Vladimir Kozlov
kvn at openjdk.java.net
Wed Jan 5 02:37:14 UTC 2022
On Wed, 5 Jan 2022 02:29:55 GMT, Jie Fu <jiefu at openjdk.org> wrote:
>> Hi all,
>>
>> Happy Christmas Day!
>>
>> We have observed that C2 fails to auto-vectorize two-dimensional array operations in our machine learning programs.
>> And we have made an reproducer in the JBS.
>>
>> Now let's discuss the reproducer.
>> The auto-vectorization fails due to `cl->slp_max_unroll() == 0` [1], which means the previous slp analysis never passed.
>>
>> As for our example, C2 had tried its first slp analysis with `future_unroll_cnt=4` [2].
>> But unfortunately, it failed due to the loop IR is too complicated [3] like the following.
>>
>> SuperWord::transform_loop: loop too complicated, cl_exit->in(0) != lpt->_head
>> cl_exit 823 823 CountedLoopEnd === 738 822 [[ 907 682 ]] [lt] P=0.999999, C=-1.000000 !orig=[680]
>> cl_exit->in(0) 738 738 IfTrue === 735 [[ 823 ]] #1 !orig=[442] !jvms: DoubleArray2::test @ bci:17 (line 10)
>> lpt->_head 1267 1267 CountedLoop === 1267 1224 682 [[ 1267 1278 1283 1284 1288 1254 1282 ]] inner stride: 2 main of N1267 !orig=[824],[748],[687] !jvms: DoubleArray2::test @ bci:30 (line 11)
>> Loop: N1267/N682 counted [int,int),+2 (65 iters) main rc has_sfpt rce
>> RangeCheck Loop: N1267/N682 counted [int,int),+2 (65 iters) main rc has_sfpt rce
>> Unroll 4 Loop: N1267/N682 counted [int,int),+2 (65 iters) main rc has_sfpt rce
>> Loop: N0/N0 has_sfpt
>> Loop: N493/N463 limit_check profile_predicated predicated counted [0,int),+1 (65 iters) sfpts={ 453 }
>> Loop: N946/N966 counted [0,int),+1 (4 iters) pre has_sfpt
>> Loop: N1483/N682 counted [int,int),+4 (65 iters) main rc has_sfpt
>> Loop: N857/N877 counted [int,int),+1 (4 iters) post has_sfpt
>> PredicatesOff
>>
>>
>> Then, C2 unrolled the loop with `unroll-factor=4` and also did some other opts, which actually simplified the loop IR representation.
>>
>> And then, comes the next round of loop unrolling analysis, in which C2 would check if `future_unroll_cnt=8` [2] is OK for unrolling.
>> C2 rejected `future_unroll_cnt=8` for this example and returned false immediately [4] without doing a second slp analysis, leaving `cl->slp_max_unroll() == 0`.
>> But if we re-do the slp analysis with `future_unroll_cnt=4` before returning false, it would pass.
>>
>> So the key idea is:
>>
>> slp analysis may fail due to the loop IR is too complicated especially during the early stage of loop unrolling analysis.
>> But after several rounds of loop unrolling and other optimizations, it's possible that the loop IR becomes simple enough to pass the slp analysis.
>> So C2 can try one more slp analysis instead of returning false immediately here [4].
>>
>>
>> We have observed up to 1.7x performance improvement by our micro benchmarks.
>>
>> 
>>
>> Testing:
>> - tier1 ~ tier3 on Linux/x64, no regression.
>>
>> Thanks.
>> Best regards,
>> Jie
>>
>>
>> [1] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/superword.cpp#L129
>> [2] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/loopTransform.cpp#L908
>> [3] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/superword.cpp#L137
>> [4] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/loopTransform.cpp#L910
>
> Jie Fu has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains seven additional commits since the last revision:
>
> - Merge branch 'master' into JDK-8279258
> - Update the copyright year
> - Merge branch 'master' into JDK-8279258
> - Remove redundant UseSuperWord check
> - Address review comments
> - Merge branch 'master' into JDK-8279258
> - 8279258: Auto-vectorization enhancement for two-dimensional array operations
May understanding was that we do SLP unrolling analysis only after we did all RCE (range check elimination) and other loop optimizations so that only unrolling is left. But this is not the case. As your output and my testing show we have RCE after we tried SLP analysis:
RangeCheck Loop: N1267/N682 counted [int,int),+2 (65 iters) main rc has_sfpt rce
Simply checking `is_unroll_only()` does not help currently:
// Only attempt slp analysis when user controls do not prohibit it
- if (LoopMaxUnroll > _local_loop_unroll_factor) {
+ if (cl->is_unroll_only() && (LoopMaxUnroll > _local_loop_unroll_factor)) {
// Once policy_slp_analysis succeeds, mark the loop with the
// maximal unroll factor so that we minimize analysis passes
because `HasRangeChecks` flag is still set even after RCE:
Loop: N1483/N682 counted [int,int),+4 (65 iters) main rc has_sfpt
Can you investigate why `HasRangeChecks` flag is still set?
-------------
PR: https://git.openjdk.java.net/jdk/pull/6933
More information about the hotspot-compiler-dev
mailing list