RFR: 8302652: [SuperWord] Reduction should happen after loop, when possible

Fri May 5 07:39:17 UTC 2023

https://github.com/openjdk/jdk/blob/cc9e7e8e773e773af87615fdae037a8f8ea82635/src/hotspot/share/opto/loopopts.cpp#L4125-L4171

I introduced a new abstract node type `UnorderedReductionNode` (subtype of `ReductionNode`). All of the reductions that can be re-ordered are to extend from this node type: `int/long add/mul/and/or/xor/min/max`, as well as `float/double min/max`. `float/double add/mul` do not allow for reordering of operations.

The optimization is part of loop-opts, and called after `SuperWord` in `PhaseIdealLoop::build_and_optimize`.

**Performance results**
I ran `test/hotspot/jtreg/compiler/loopopts/superword/ReductionPerf.java`, with `2_000` warmup and `100_000` perf iterations. I also increased the array length to `RANGE = 16*1024`.

I disabled `turbo-boost`.
Machine: `11th Gen Intel® Core™ i7-11850H @ 2.50GHz × 16`.
Full `avx512` support, including `avx512dq` required for `MulReductionVL`.

operation     M-N-2  M-N-3  M-2    M-3    P-2    P-3   | note |
---------------------------------------------------------------
int add       2063   2085   660    530    415    283   |      |
int mul       2272   2257   1189   733    908    439   |      |
int min       2527   2520   2516   2579   2585   2542  | 1    |
int max       2548   2525   2551   2516   2515   2517  | 1    |
int and       2410   2414   602    480    353    263   |      |
int or        2149   2151   597    498    354    262   |      |
int xor       2059   2062   605    476    364    263   |      |
long add      1776   1790   2000   1000   1683   591   | 2    |
long mul      2135   2199   2185   2001   2176   1307  | 2    |
long min      1439   1424   1421   1420   1430   1427  | 3    |
long max      2299   2287   2303   2305   1433   1425  | 3    |
long and      1657   1667   2015   1003   1679   568   | 4    |
long or       1776   1783   2032   1009   1680   569   | 4    |
long xor      1834   1783   2012   1024   1679   570   | 4    |
float add     2779   2644   2633   2648   2632   2639  | 5    |
float mul     2779   2871   2810   2776   2732   2791  | 5    |
float min     2294   2620   1725   1286   872    672   |      |
float max     2371   2519   1697   1265   841    468   |      |
double add    2634   2636   2635   2650   2635   2648  | 5    |
double mul    3053   2955   2881   3030   2979   2927  | 5    |
double min    2364   2400   2439   2399   2486   2398  | 6    |
double max    2488   2459   2501   2451   2493   2498  | 6    |

Legend: `M` master, `P` with patch, `N` no superword reductions (`-XX:-SuperWordReductions`), `2` AVX2, `3` AVX512.

The lines without note show clear speedup as expected.

Notes:
1. `int min/max`: bug [JDK-8302673](https://bugs.openjdk.org/browse/JDK-8302673)
2. `long add/mul`: without the patch, it seems that vectorization actually would be slower. Even now, only AVX512 really leads to a speedup. Note: `MulReductionVL` requires `avx512dq`.
3. `long min/max`: `Math.max(long, long)` is currently not intrinsified [JDK-8307513](https://bugs.openjdk.org/browse/JDK-8307513).
4. `long and/or/xor`: without patch on AVX2, vectorization is slower. With patch, it is always faster now.
5. `float/double add/mul`: IEEE requires linear reduction. This cannot be moved outside loop. Vectorization has no benefit in these examples.
6. `double min/max`: bug [JDK-8300865](https://bugs.openjdk.org/browse/JDK-8300865).

**Testing**

I modified the reduction IR tests, so that they expect at most 2 Reduction nodes (one per main-loop, and optionally one for the vectorized post-loop). Before my patch, these IR tests would find many Reduction nodes, and would have failed. This is because after SuperWord, we unroll the loop multiple times, and so we clone the Reduction nodes inside the main loop.

Passes up to tier5 and stress-testing.
**TODO report performance testing (running)**
**TODO** can someone benchmark on `aarch64`?

**Discussion**

We should investigate if we can now allow reductions more eagerly, at least for `UnorderedReduction`, as the overhead is now much lower. @jatin-bhateja pointed to this:
https://github.com/openjdk/jdk/blob/941a7ac7dab243c6033a78880fd31faa803e62ab/src/hotspot/share/opto/superword.cpp#L2265
I filed [JDK-8307516](https://bugs.openjdk.org/browse/JDK-8307516).

So far, I did not work on `byte, char, short`, we can investigate this in the future.

FYI: I investigated if this may be helpful for the Vector API. As far as I can see, Reductions are only introduced with a vector-iunput, and the scalar-input is always the identity-element. This optimization here assumes that we have the Phi-loop going through the scalar-input. So I think this optimization here really only helps `SuperWord` for now.

-------------

Commit messages:
 - Merge branch 'master' into JDK-8302652
 - small bug fix
 - generalized the algorithm to handle a chain of UnorderedReductions
 - Moved move_unordered_reduction_out_of_loop from SuperWord to PhaseIdealLoop
 - neutral -> identity element
 - Moved code from Ideal to SuperWord
 - Vladimir's suggestions for ReductionPerf.java
 - pushed updated ReductionPerf.java
 - added IR rules to validate reduced use of Reduce node
 - fix for and, or, xor, min, max
 - ... and 2 more: https://git.openjdk.org/jdk/compare/ccf91f88...cc9e7e8e

Changes: https://git.openjdk.org/jdk/pull/13056/files
 Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=13056&range=00
  Issue: https://bugs.openjdk.org/browse/JDK-8302652
  Stats: 831 lines in 14 files changed: 627 ins; 27 del; 177 mod
  Patch: https://git.openjdk.org/jdk/pull/13056.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/13056/head:pull/13056

PR: https://git.openjdk.org/jdk/pull/13056