RFR: 8279258: Auto-vectorization enhancement for two-dimensional array operations
Jie Fu
jiefu at openjdk.java.net
Fri Dec 24 10:34:44 UTC 2021
Hi all,
Happy Christmas Day!
We have observed that C2 fails to auto-vectorize two-dimensional array operations in our machine learning programs.
And we have made an reproducer in the JBS.
Now let's discuss the reproducer.
The auto-vectorization fails due to `cl->slp_max_unroll() == 0` [1], which means the previous slp analysis never passed.
As for our example, C2 had tried its first slp analysis with `future_unroll_cnt=4` [2].
But unfortunately, it failed due to the loop IR is too complicated [3] like the following.
SuperWord::transform_loop: loop too complicated, cl_exit->in(0) != lpt->_head
cl_exit 823 823 CountedLoopEnd === 738 822 [[ 907 682 ]] [lt] P=0.999999, C=-1.000000 !orig=[680]
cl_exit->in(0) 738 738 IfTrue === 735 [[ 823 ]] #1 !orig=[442] !jvms: DoubleArray2::test @ bci:17 (line 10)
lpt->_head 1267 1267 CountedLoop === 1267 1224 682 [[ 1267 1278 1283 1284 1288 1254 1282 ]] inner stride: 2 main of N1267 !orig=[824],[748],[687] !jvms: DoubleArray2::test @ bci:30 (line 11)
Loop: N1267/N682 counted [int,int),+2 (65 iters) main rc has_sfpt rce
RangeCheck Loop: N1267/N682 counted [int,int),+2 (65 iters) main rc has_sfpt rce
Unroll 4 Loop: N1267/N682 counted [int,int),+2 (65 iters) main rc has_sfpt rce
Loop: N0/N0 has_sfpt
Loop: N493/N463 limit_check profile_predicated predicated counted [0,int),+1 (65 iters) sfpts={ 453 }
Loop: N946/N966 counted [0,int),+1 (4 iters) pre has_sfpt
Loop: N1483/N682 counted [int,int),+4 (65 iters) main rc has_sfpt
Loop: N857/N877 counted [int,int),+1 (4 iters) post has_sfpt
PredicatesOff
Then, C2 unrolled the loop with `unroll-factor=4` and also did some other opts, which actually simplified the loop IR representation.
And then, comes the next round of loop unrolling analysis, in which C2 would check if `future_unroll_cnt=8` [2] is OK for unrolling.
C2 rejected `future_unroll_cnt=8` for this example and returned false immediately [4] without doing a second slp analysis, leaving `cl->slp_max_unroll() == 0`.
But if we re-do the slp analysis with `future_unroll_cnt=4` before returning false, it would pass.
So the key idea is:
slp analysis may fail due to the loop IR is too complicated especially during the early stage of loop unrolling analysis.
But after several rounds of loop unrolling and other optimizations, it's possible that the loop IR becomes simple enough to pass the slp analysis.
So C2 can try one more slp analysis instead of returning false immediately here [4].
We have observed up to 1.7x performance improvement by our micro benchmarks.
![image](https://user-images.githubusercontent.com/19923746/147344527-b4d9c0ae-c0d4-4cac-b17a-48474648b21a.png)
Testing:
- tier1 ~ tier3 on Linux/x64, no regression.
Thanks.
Best regards,
Jie
[1] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/superword.cpp#L129
[2] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/loopTransform.cpp#L908
[3] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/superword.cpp#L137
[4] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/loopTransform.cpp#L910
-------------
Commit messages:
- 8279258: Auto-vectorization enhancement for two-dimensional array operations
Changes: https://git.openjdk.java.net/jdk/pull/6933/files
Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=6933&range=00
Issue: https://bugs.openjdk.java.net/browse/JDK-8279258
Stats: 140 lines in 2 files changed: 136 ins; 0 del; 4 mod
Patch: https://git.openjdk.java.net/jdk/pull/6933.diff
Fetch: git fetch https://git.openjdk.java.net/jdk pull/6933/head:pull/6933
PR: https://git.openjdk.java.net/jdk/pull/6933
More information about the hotspot-compiler-dev
mailing list