RFR: 8279258: Auto-vectorization enhancement for two-dimensional array operations [v3]
Jie Fu
jiefu at openjdk.java.net
Thu Dec 30 23:25:46 UTC 2021
> Hi all,
>
> Happy Christmas Day!
>
> We have observed that C2 fails to auto-vectorize two-dimensional array operations in our machine learning programs.
> And we have made an reproducer in the JBS.
>
> Now let's discuss the reproducer.
> The auto-vectorization fails due to `cl->slp_max_unroll() == 0` [1], which means the previous slp analysis never passed.
>
> As for our example, C2 had tried its first slp analysis with `future_unroll_cnt=4` [2].
> But unfortunately, it failed due to the loop IR is too complicated [3] like the following.
>
> SuperWord::transform_loop: loop too complicated, cl_exit->in(0) != lpt->_head
> cl_exit 823 823 CountedLoopEnd === 738 822 [[ 907 682 ]] [lt] P=0.999999, C=-1.000000 !orig=[680]
> cl_exit->in(0) 738 738 IfTrue === 735 [[ 823 ]] #1 !orig=[442] !jvms: DoubleArray2::test @ bci:17 (line 10)
> lpt->_head 1267 1267 CountedLoop === 1267 1224 682 [[ 1267 1278 1283 1284 1288 1254 1282 ]] inner stride: 2 main of N1267 !orig=[824],[748],[687] !jvms: DoubleArray2::test @ bci:30 (line 11)
> Loop: N1267/N682 counted [int,int),+2 (65 iters) main rc has_sfpt rce
> RangeCheck Loop: N1267/N682 counted [int,int),+2 (65 iters) main rc has_sfpt rce
> Unroll 4 Loop: N1267/N682 counted [int,int),+2 (65 iters) main rc has_sfpt rce
> Loop: N0/N0 has_sfpt
> Loop: N493/N463 limit_check profile_predicated predicated counted [0,int),+1 (65 iters) sfpts={ 453 }
> Loop: N946/N966 counted [0,int),+1 (4 iters) pre has_sfpt
> Loop: N1483/N682 counted [int,int),+4 (65 iters) main rc has_sfpt
> Loop: N857/N877 counted [int,int),+1 (4 iters) post has_sfpt
> PredicatesOff
>
>
> Then, C2 unrolled the loop with `unroll-factor=4` and also did some other opts, which actually simplified the loop IR representation.
>
> And then, comes the next round of loop unrolling analysis, in which C2 would check if `future_unroll_cnt=8` [2] is OK for unrolling.
> C2 rejected `future_unroll_cnt=8` for this example and returned false immediately [4] without doing a second slp analysis, leaving `cl->slp_max_unroll() == 0`.
> But if we re-do the slp analysis with `future_unroll_cnt=4` before returning false, it would pass.
>
> So the key idea is:
>
> slp analysis may fail due to the loop IR is too complicated especially during the early stage of loop unrolling analysis.
> But after several rounds of loop unrolling and other optimizations, it's possible that the loop IR becomes simple enough to pass the slp analysis.
> So C2 can try one more slp analysis instead of returning false immediately here [4].
>
>
> We have observed up to 1.7x performance improvement by our micro benchmarks.
>
> ![image](https://user-images.githubusercontent.com/19923746/147344527-b4d9c0ae-c0d4-4cac-b17a-48474648b21a.png)
>
> Testing:
> - tier1 ~ tier3 on Linux/x64, no regression.
>
> Thanks.
> Best regards,
> Jie
>
>
> [1] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/superword.cpp#L129
> [2] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/loopTransform.cpp#L908
> [3] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/superword.cpp#L137
> [4] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/loopTransform.cpp#L910
Jie Fu has updated the pull request incrementally with one additional commit since the last revision:
Remove redundant UseSuperWord check
-------------
Changes:
- all: https://git.openjdk.java.net/jdk/pull/6933/files
- new: https://git.openjdk.java.net/jdk/pull/6933/files/1a9b4c84..3af74828
Webrevs:
- full: https://webrevs.openjdk.java.net/?repo=jdk&pr=6933&range=02
- incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=6933&range=01-02
Stats: 2 lines in 1 file changed: 0 ins; 1 del; 1 mod
Patch: https://git.openjdk.java.net/jdk/pull/6933.diff
Fetch: git fetch https://git.openjdk.java.net/jdk pull/6933/head:pull/6933
PR: https://git.openjdk.java.net/jdk/pull/6933
More information about the hotspot-compiler-dev
mailing list