RFR: 8279258: Auto-vectorization enhancement for two-dimensional array operations

Fri Dec 24 10:34:44 UTC 2021

Hi all,

Happy Christmas Day!

We have observed that C2 fails to auto-vectorize two-dimensional array operations in our machine learning programs.
And we have made an reproducer in the JBS.

Now let's discuss the reproducer.
The auto-vectorization fails due to `cl->slp_max_unroll() == 0` [1], which means the previous slp analysis never passed.

As for our example, C2 had tried its first slp analysis with `future_unroll_cnt=4` [2].
But unfortunately, it failed due to the loop IR is too complicated [3] like the following.

SuperWord::transform_loop: loop too complicated, cl_exit->in(0) != lpt->_head
cl_exit 823 823  CountedLoopEnd  ===  738  822  [[ 907  682 ]] [lt] P=0.999999, C=-1.000000 !orig=[680]
cl_exit->in(0) 738 738  IfTrue  ===  735  [[ 823 ]] #1 !orig=[442] !jvms: DoubleArray2::test @ bci:17 (line 10)
lpt->_head 1267 1267  CountedLoop  ===  1267  1224  682  [[ 1267  1278  1283  1284  1288  1254  1282 ]] inner stride: 2 main of N1267 !orig=[824],[748],[687] !jvms: DoubleArray2::test @ bci:30 (line 11)
    Loop: N1267/N682  counted [int,int),+2 (65 iters)  main rc  has_sfpt rce
RangeCheck       Loop: N1267/N682  counted [int,int),+2 (65 iters)  main rc  has_sfpt rce
Unroll 4         Loop: N1267/N682  counted [int,int),+2 (65 iters)  main rc  has_sfpt rce
Loop: N0/N0  has_sfpt
  Loop: N493/N463  limit_check profile_predicated predicated counted [0,int),+1 (65 iters)  sfpts={ 453 }
    Loop: N946/N966  counted [0,int),+1 (4 iters)  pre has_sfpt
    Loop: N1483/N682  counted [int,int),+4 (65 iters)  main rc  has_sfpt
    Loop: N857/N877  counted [int,int),+1 (4 iters)  post has_sfpt
PredicatesOff

Then, C2 unrolled the loop with `unroll-factor=4` and also did some other opts, which actually simplified the loop IR representation.

And then, comes the next round of loop unrolling analysis, in which C2 would check if `future_unroll_cnt=8` [2] is OK for unrolling.
C2 rejected `future_unroll_cnt=8` for this example and returned false immediately [4] without doing a second slp analysis, leaving `cl->slp_max_unroll() == 0`.
But if we re-do the slp analysis with `future_unroll_cnt=4` before returning false, it would pass.

So the key idea is:

  slp analysis may fail due to the loop IR is too complicated especially during the early stage of loop unrolling analysis.
  But after several rounds of loop unrolling and other optimizations, it's possible that the loop IR becomes simple enough to pass the slp analysis.
  So C2 can try one more slp analysis instead of returning false immediately here [4].

We have observed up to 1.7x performance improvement by our micro benchmarks.

![image](https://user-images.githubusercontent.com/19923746/147344527-b4d9c0ae-c0d4-4cac-b17a-48474648b21a.png)

Testing:
  - tier1 ~ tier3 on Linux/x64, no regression.

Thanks.
Best regards,
Jie

[1] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/superword.cpp#L129
[2] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/loopTransform.cpp#L908
[3] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/superword.cpp#L137
[4] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/loopTransform.cpp#L910

-------------

Commit messages:
 - 8279258: Auto-vectorization enhancement for two-dimensional array operations

Changes: https://git.openjdk.java.net/jdk/pull/6933/files
 Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=6933&range=00
  Issue: https://bugs.openjdk.java.net/browse/JDK-8279258
  Stats: 140 lines in 2 files changed: 136 ins; 0 del; 4 mod
  Patch: https://git.openjdk.java.net/jdk/pull/6933.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/6933/head:pull/6933

PR: https://git.openjdk.java.net/jdk/pull/6933