RFR: 8279258: Auto-vectorization enhancement for two-dimensional array operations
Nils Eliasson
neliasso at openjdk.java.net
Mon Dec 27 09:57:14 UTC 2021
On Fri, 24 Dec 2021 10:26:30 GMT, Jie Fu <jiefu at openjdk.org> wrote:
> Hi all,
>
> Happy Christmas Day!
>
> We have observed that C2 fails to auto-vectorize two-dimensional array operations in our machine learning programs.
> And we have made an reproducer in the JBS.
>
> Now let's discuss the reproducer.
> The auto-vectorization fails due to `cl->slp_max_unroll() == 0` [1], which means the previous slp analysis never passed.
>
> As for our example, C2 had tried its first slp analysis with `future_unroll_cnt=4` [2].
> But unfortunately, it failed due to the loop IR is too complicated [3] like the following.
>
> SuperWord::transform_loop: loop too complicated, cl_exit->in(0) != lpt->_head
> cl_exit 823 823 CountedLoopEnd === 738 822 [[ 907 682 ]] [lt] P=0.999999, C=-1.000000 !orig=[680]
> cl_exit->in(0) 738 738 IfTrue === 735 [[ 823 ]] #1 !orig=[442] !jvms: DoubleArray2::test @ bci:17 (line 10)
> lpt->_head 1267 1267 CountedLoop === 1267 1224 682 [[ 1267 1278 1283 1284 1288 1254 1282 ]] inner stride: 2 main of N1267 !orig=[824],[748],[687] !jvms: DoubleArray2::test @ bci:30 (line 11)
> Loop: N1267/N682 counted [int,int),+2 (65 iters) main rc has_sfpt rce
> RangeCheck Loop: N1267/N682 counted [int,int),+2 (65 iters) main rc has_sfpt rce
> Unroll 4 Loop: N1267/N682 counted [int,int),+2 (65 iters) main rc has_sfpt rce
> Loop: N0/N0 has_sfpt
> Loop: N493/N463 limit_check profile_predicated predicated counted [0,int),+1 (65 iters) sfpts={ 453 }
> Loop: N946/N966 counted [0,int),+1 (4 iters) pre has_sfpt
> Loop: N1483/N682 counted [int,int),+4 (65 iters) main rc has_sfpt
> Loop: N857/N877 counted [int,int),+1 (4 iters) post has_sfpt
> PredicatesOff
>
>
> Then, C2 unrolled the loop with `unroll-factor=4` and also did some other opts, which actually simplified the loop IR representation.
>
> And then, comes the next round of loop unrolling analysis, in which C2 would check if `future_unroll_cnt=8` [2] is OK for unrolling.
> C2 rejected `future_unroll_cnt=8` for this example and returned false immediately [4] without doing a second slp analysis, leaving `cl->slp_max_unroll() == 0`.
> But if we re-do the slp analysis with `future_unroll_cnt=4` before returning false, it would pass.
>
> So the key idea is:
>
> slp analysis may fail due to the loop IR is too complicated especially during the early stage of loop unrolling analysis.
> But after several rounds of loop unrolling and other optimizations, it's possible that the loop IR becomes simple enough to pass the slp analysis.
> So C2 can try one more slp analysis instead of returning false immediately here [4].
>
>
> We have observed up to 1.7x performance improvement by our micro benchmarks.
>
> 
>
> Testing:
> - tier1 ~ tier3 on Linux/x64, no regression.
>
> Thanks.
> Best regards,
> Jie
>
>
> [1] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/superword.cpp#L129
> [2] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/loopTransform.cpp#L908
> [3] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/superword.cpp#L137
> [4] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/loopTransform.cpp#L910
Hi Jie,
I hope you have a good holiday too!
Nice find, and straight forward fix too. I have one comment in the code.
It's excellent that you have added a microbenchmark too. I would like to have a small regression test too, that quickly fails if this would break in the future. Perhaps something using the IR Testing framework.
Best regards,
Nils Eliasson
src/hotspot/share/opto/loopTransform.cpp line 913:
> 911: 1.2 * cl->node_count_before_unroll() < (double)_body.size()) {
> 912: if (UseSuperWord && (cl->slp_max_unroll() == 0) &&
> 913: (cl->unrolled_count() - 1) * (100.0 / LoopPercentProfileLimit) <= cl->profile_trip_cnt()) {
On line 911 and 913 this is repeated:
"(X - 1) * (100.0 / LoopPercentProfileLimit) > cl->profile_trip_cnt()"
Please replace that with a method.
-------------
PR: https://git.openjdk.java.net/jdk/pull/6933
More information about the hotspot-compiler-dev
mailing list