RFR: 8302652: [SuperWord] Reduction should happen after loop, when possible

Fri May 5 07:39:25 UTC 2023

On Thu, 16 Mar 2023 07:29:44 GMT, Emanuel Peter <epeter at openjdk.org> wrote:

> https://github.com/openjdk/jdk/blob/cc9e7e8e773e773af87615fdae037a8f8ea82635/src/hotspot/share/opto/loopopts.cpp#L4125-L4171
> 
> I introduced a new abstract node type `UnorderedReductionNode` (subtype of `ReductionNode`). All of the reductions that can be re-ordered are to extend from this node type: `int/long add/mul/and/or/xor/min/max`, as well as `float/double min/max`. `float/double add/mul` do not allow for reordering of operations.
> 
> The optimization is part of loop-opts, and called after `SuperWord` in `PhaseIdealLoop::build_and_optimize`.
> 
> **Performance results**
> I ran `test/hotspot/jtreg/compiler/loopopts/superword/ReductionPerf.java`, with `2_000` warmup and `100_000` perf iterations. I also increased the array length to `RANGE = 16*1024`.
> 
> I disabled `turbo-boost`.
> Machine: `11th Gen Intel® Core™ i7-11850H @ 2.50GHz × 16`.
> Full `avx512` support, including `avx512dq` required for `MulReductionVL`.
> 
> 
> operation     M-N-2  M-N-3  M-2    M-3    P-2    P-3   | note |
> ---------------------------------------------------------------
> int add       2063   2085   660    530    415    283   |      |
> int mul       2272   2257   1189   733    908    439   |      |
> int min       2527   2520   2516   2579   2585   2542  | 1    |
> int max       2548   2525   2551   2516   2515   2517  | 1    |
> int and       2410   2414   602    480    353    263   |      |
> int or        2149   2151   597    498    354    262   |      |
> int xor       2059   2062   605    476    364    263   |      |
> long add      1776   1790   2000   1000   1683   591   | 2    |
> long mul      2135   2199   2185   2001   2176   1307  | 2    |
> long min      1439   1424   1421   1420   1430   1427  | 3    |
> long max      2299   2287   2303   2305   1433   1425  | 3    |
> long and      1657   1667   2015   1003   1679   568   | 4    |
> long or       1776   1783   2032   1009   1680   569   | 4    |
> long xor      1834   1783   2012   1024   1679   570   | 4    |
> float add     2779   2644   2633   2648   2632   2639  | 5    |
> float mul     2779   2871   2810   2776   2732   2791  | 5    |
> float min     2294   2620   1725   1286   872    672   |      |
> float max     2371   2519   1697   1265   841    468   |      |
> double add    2634   2636   2635   2650   2635   2648  | 5    |
> double mul    3053   2955   2881   3030   2979   2927  | 5    |
> double min    2364   2400   2439   2399   2486   2398  | 6    |
> double max    2488   2459   2501   2451   2493   2498  | 6    |
> 
> Legend: `M` master, `P` with patch, `N` no superword reductions (`-XX:-SuperWordReductions`), `2` AVX2, `3` AVX512.
> 
> The lines without note show clear speedup as expected.
> 
> Notes:
> 1. `int min/max`: bug [JDK-8302673](https://bugs.openjdk.org/browse/JDK-8302673)
> 2. `long add/mul`: without the patch, it seems that vectorization actually would be slower. Even now, only AVX512 really leads to a speedup. Note: `MulReductionVL` requires `avx512dq`.
> 3. `long min/max`: `Math.max(long, long)` is currently not intrinsified [JDK-8307513](https://bugs.openjdk.org/browse/JDK-8307513).
> 4. `long and/or/xor`: without patch on AVX2, vectorization is slower. With patch, it is always faster now.
> 5. `float/double add/mul`: IEEE requires linear reduction. This cannot be moved outside loop. Vectorization has no benefit in these examples.
> 6. `double min/max`: bug [JDK-8300865](https://bugs.openjdk.org/browse/JDK-8300865).
> 
> **Testing**
> 
> I modified the reduction IR tests, so that they expect at most 2 Reduction nodes (one per main-loop, and optionally one for the vectorized post-loop). Before my patch, these IR tests would find many Reduction nodes, and would have failed. This is because after SuperWord, we unroll the loop multiple times, and so we clone the Reduction nodes inside the main loop.
> 
> Passes up to tier5 and stress-testing.
> **TODO report performance testing (running)**
> **TODO** can someone benchmark on `aarch64`?
> 
> **Discussion**
> 
> We should investigate if we can now allow reductions more eagerly, at least for `UnorderedReduction`, as the overhead is now much lower. @jatin-bhateja pointed to this:
> https://github.com/openjdk/jdk/blob/941a7ac7dab243c6033a78880fd31faa803e62ab/src/hotspot/share/opto/superword.cpp#L2265
> I filed [JDK-8307516](https://bugs.openjdk.org/browse/JDK-8307516).
> 
> So far, I did not work on `byte, char, short`, we can investigate this in the future.
> 
> FYI: I investigated if this may be helpful for the Vector API. As far as I can see, Reductions are only introduced with a vector-iunput, and the scalar-input is always the identity-element. This optimization here assumes that we have the Phi-loop going through the scalar-input. So I think this optimization here really only helps `SuperWord` for now.

@jatin-bhateja @sviswa7
What do you think about the performance numbers I measured? Do they make sense to you?

A few questions:
 - `long min/max`: Why do we require `avx512vlbwdq` in `Matcher::match_rule_supported_vector`? Would `avx512f` not be sufficient? `C2_MacroAssembler::reduceL` leads me to `vextracti64x4` (that should only require `avx512f`) and `reduce_operation_256` (where `vpminsq` only requires `avx2` via assert).
 - What do you think about the `double min/max` performance? What do you think could be the reason it is not similar to the behavior of `float min/max`?

@jatin-bhateja @vnkozlov @sviswa7 I substantially reworked this RFE, and have it working now, and included your suggestions.

The lagorithm now sits in `PhaseIdealLoop::build_and_optimize` after `SuperWord`. It can now handle chains of `UnorderedReduction`, so that it is more robust agains unrolling.

The only thing missing for me is:
1. Benchmark on `aarch64`. @fg1417 Would you want to have a look at that?
2. Wait for the performance testing results.

src/hotspot/share/opto/superword.cpp line 2670:

> 2668:           const Type *arith_type = n->bottom_type();
> 2669:           vn = ReductionNode::make(opc, nullptr, in1, in2, arith_type->basic_type());
> 2670:           if (vn->is_UnorderedReduction()) {

@jatin-bhateja I want to check if `vn` is a `UnorderedReduction`, so I want a `bool` answer. If I ask for `isa_UnorderdReduction()`, I would get a `UnorderedReductionNode*`, and `nullptr` if it is not a `UnorderedReduction`.
Maybe I did not understand your suggestion.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/13056#issuecomment-1479124966
PR Comment: https://git.openjdk.org/jdk/pull/13056#issuecomment-1535844780
PR Review Comment: https://git.openjdk.org/jdk/pull/13056#discussion_r1158362742