RFR: 8279258: Auto-vectorization enhancement for two-dimensional array operations [v5]

Wed Jan 5 02:37:14 UTC 2022

On Wed, 5 Jan 2022 02:29:55 GMT, Jie Fu <jiefu at openjdk.org> wrote:

>> Hi all,
>> 
>> Happy Christmas Day!
>> 
>> We have observed that C2 fails to auto-vectorize two-dimensional array operations in our machine learning programs.
>> And we have made an reproducer in the JBS.
>> 
>> Now let's discuss the reproducer.
>> The auto-vectorization fails due to `cl->slp_max_unroll() == 0` [1], which means the previous slp analysis never passed.
>> 
>> As for our example, C2 had tried its first slp analysis with `future_unroll_cnt=4` [2].
>> But unfortunately, it failed due to the loop IR is too complicated [3] like the following.
>> 
>> SuperWord::transform_loop: loop too complicated, cl_exit->in(0) != lpt->_head
>> cl_exit 823 823  CountedLoopEnd  ===  738  822  [[ 907  682 ]] [lt] P=0.999999, C=-1.000000 !orig=[680]
>> cl_exit->in(0) 738 738  IfTrue  ===  735  [[ 823 ]] #1 !orig=[442] !jvms: DoubleArray2::test @ bci:17 (line 10)
>> lpt->_head 1267 1267  CountedLoop  ===  1267  1224  682  [[ 1267  1278  1283  1284  1288  1254  1282 ]] inner stride: 2 main of N1267 !orig=[824],[748],[687] !jvms: DoubleArray2::test @ bci:30 (line 11)
>>     Loop: N1267/N682  counted [int,int),+2 (65 iters)  main rc  has_sfpt rce
>> RangeCheck       Loop: N1267/N682  counted [int,int),+2 (65 iters)  main rc  has_sfpt rce
>> Unroll 4         Loop: N1267/N682  counted [int,int),+2 (65 iters)  main rc  has_sfpt rce
>> Loop: N0/N0  has_sfpt
>>   Loop: N493/N463  limit_check profile_predicated predicated counted [0,int),+1 (65 iters)  sfpts={ 453 }
>>     Loop: N946/N966  counted [0,int),+1 (4 iters)  pre has_sfpt
>>     Loop: N1483/N682  counted [int,int),+4 (65 iters)  main rc  has_sfpt
>>     Loop: N857/N877  counted [int,int),+1 (4 iters)  post has_sfpt
>> PredicatesOff
>> 
>> 
>> Then, C2 unrolled the loop with `unroll-factor=4` and also did some other opts, which actually simplified the loop IR representation.
>> 
>> And then, comes the next round of loop unrolling analysis, in which C2 would check if `future_unroll_cnt=8` [2] is OK for unrolling.
>> C2 rejected `future_unroll_cnt=8` for this example and returned false immediately [4] without doing a second slp analysis, leaving `cl->slp_max_unroll() == 0`.
>> But if we re-do the slp analysis with `future_unroll_cnt=4` before returning false, it would pass.
>> 
>> So the key idea is:
>> 
>>   slp analysis may fail due to the loop IR is too complicated especially during the early stage of loop unrolling analysis.
>>   But after several rounds of loop unrolling and other optimizations, it's possible that the loop IR becomes simple enough to pass the slp analysis.
>>   So C2 can try one more slp analysis instead of returning false immediately here [4].
>> 
>> 
>> We have observed up to 1.7x performance improvement by our micro benchmarks.
>> 
>> ![image](https://user-images.githubusercontent.com/19923746/147344527-b4d9c0ae-c0d4-4cac-b17a-48474648b21a.png)
>> 
>> Testing:
>>   - tier1 ~ tier3 on Linux/x64, no regression.
>> 
>> Thanks.
>> Best regards,
>> Jie
>> 
>> 
>> [1] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/superword.cpp#L129
>> [2] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/loopTransform.cpp#L908
>> [3] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/superword.cpp#L137
>> [4] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/loopTransform.cpp#L910
>
> Jie Fu has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains seven additional commits since the last revision:
> 
>  - Merge branch 'master' into JDK-8279258
>  - Update the copyright year
>  - Merge branch 'master' into JDK-8279258
>  - Remove redundant UseSuperWord check
>  - Address review comments
>  - Merge branch 'master' into JDK-8279258
>  - 8279258: Auto-vectorization enhancement for two-dimensional array operations

May understanding was that we do SLP unrolling analysis only after we did all RCE (range check elimination) and other loop optimizations so that only unrolling is left. But this is not the case. As your output and my testing show we have RCE after we tried SLP analysis:

RangeCheck       Loop: N1267/N682  counted [int,int),+2 (65 iters)  main rc  has_sfpt rce

Simply checking `is_unroll_only()` does not help currently:

     // Only attempt slp analysis when user controls do not prohibit it
-    if (LoopMaxUnroll > _local_loop_unroll_factor) {
+    if (cl->is_unroll_only() && (LoopMaxUnroll > _local_loop_unroll_factor)) {
       // Once policy_slp_analysis succeeds, mark the loop with the
       // maximal unroll factor so that we minimize analysis passes

because `HasRangeChecks` flag is still set even after RCE:

    Loop: N1483/N682  counted [int,int),+4 (65 iters)  main rc  has_sfpt

Can you investigate why  `HasRangeChecks` flag is still set?

-------------

PR: https://git.openjdk.java.net/jdk/pull/6933