RFR: 8279258: Auto-vectorization enhancement for two-dimensional array operations [v2]
Nils Eliasson
neliasso at openjdk.java.net
Thu Dec 30 09:43:22 UTC 2021
On Mon, 27 Dec 2021 14:41:58 GMT, Jie Fu <jiefu at openjdk.org> wrote:
>> Hi all,
>>
>> Happy Christmas Day!
>>
>> We have observed that C2 fails to auto-vectorize two-dimensional array operations in our machine learning programs.
>> And we have made an reproducer in the JBS.
>>
>> Now let's discuss the reproducer.
>> The auto-vectorization fails due to `cl->slp_max_unroll() == 0` [1], which means the previous slp analysis never passed.
>>
>> As for our example, C2 had tried its first slp analysis with `future_unroll_cnt=4` [2].
>> But unfortunately, it failed due to the loop IR is too complicated [3] like the following.
>>
>> SuperWord::transform_loop: loop too complicated, cl_exit->in(0) != lpt->_head
>> cl_exit 823 823 CountedLoopEnd === 738 822 [[ 907 682 ]] [lt] P=0.999999, C=-1.000000 !orig=[680]
>> cl_exit->in(0) 738 738 IfTrue === 735 [[ 823 ]] #1 !orig=[442] !jvms: DoubleArray2::test @ bci:17 (line 10)
>> lpt->_head 1267 1267 CountedLoop === 1267 1224 682 [[ 1267 1278 1283 1284 1288 1254 1282 ]] inner stride: 2 main of N1267 !orig=[824],[748],[687] !jvms: DoubleArray2::test @ bci:30 (line 11)
>> Loop: N1267/N682 counted [int,int),+2 (65 iters) main rc has_sfpt rce
>> RangeCheck Loop: N1267/N682 counted [int,int),+2 (65 iters) main rc has_sfpt rce
>> Unroll 4 Loop: N1267/N682 counted [int,int),+2 (65 iters) main rc has_sfpt rce
>> Loop: N0/N0 has_sfpt
>> Loop: N493/N463 limit_check profile_predicated predicated counted [0,int),+1 (65 iters) sfpts={ 453 }
>> Loop: N946/N966 counted [0,int),+1 (4 iters) pre has_sfpt
>> Loop: N1483/N682 counted [int,int),+4 (65 iters) main rc has_sfpt
>> Loop: N857/N877 counted [int,int),+1 (4 iters) post has_sfpt
>> PredicatesOff
>>
>>
>> Then, C2 unrolled the loop with `unroll-factor=4` and also did some other opts, which actually simplified the loop IR representation.
>>
>> And then, comes the next round of loop unrolling analysis, in which C2 would check if `future_unroll_cnt=8` [2] is OK for unrolling.
>> C2 rejected `future_unroll_cnt=8` for this example and returned false immediately [4] without doing a second slp analysis, leaving `cl->slp_max_unroll() == 0`.
>> But if we re-do the slp analysis with `future_unroll_cnt=4` before returning false, it would pass.
>>
>> So the key idea is:
>>
>> slp analysis may fail due to the loop IR is too complicated especially during the early stage of loop unrolling analysis.
>> But after several rounds of loop unrolling and other optimizations, it's possible that the loop IR becomes simple enough to pass the slp analysis.
>> So C2 can try one more slp analysis instead of returning false immediately here [4].
>>
>>
>> We have observed up to 1.7x performance improvement by our micro benchmarks.
>>
>> ![image](https://user-images.githubusercontent.com/19923746/147344527-b4d9c0ae-c0d4-4cac-b17a-48474648b21a.png)
>>
>> Testing:
>> - tier1 ~ tier3 on Linux/x64, no regression.
>>
>> Thanks.
>> Best regards,
>> Jie
>>
>>
>> [1] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/superword.cpp#L129
>> [2] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/loopTransform.cpp#L908
>> [3] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/superword.cpp#L137
>> [4] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/loopTransform.cpp#L910
>
> Jie Fu has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision:
>
> - Address review comments
> - Merge branch 'master' into JDK-8279258
> - 8279258: Auto-vectorization enhancement for two-dimensional array operations
Yes. Looks good!
-------------
Marked as reviewed by neliasso (Reviewer).
PR: https://git.openjdk.java.net/jdk/pull/6933
More information about the hotspot-compiler-dev
mailing list