RFR: 8302652: [SuperWord] Reduction should happen after loop, when possible [v5]
Emanuel Peter
epeter at openjdk.org
Wed May 10 11:45:38 UTC 2023
> https://github.com/openjdk/jdk/blob/cc9e7e8e773e773af87615fdae037a8f8ea82635/src/hotspot/share/opto/loopopts.cpp#L4125-L4171
>
> I introduced a new abstract node type `UnorderedReductionNode` (subtype of `ReductionNode`). All of the reductions that can be re-ordered are to extend from this node type: `int/long add/mul/and/or/xor/min/max`, as well as `float/double min/max`. `float/double add/mul` do not allow for reordering of operations.
>
> The optimization is part of loop-opts, and called after `SuperWord` in `PhaseIdealLoop::build_and_optimize`.
>
> **Performance results**
> I ran `test/hotspot/jtreg/compiler/loopopts/superword/ReductionPerf.java`, with `2_000` warmup and `100_000` perf iterations. I also increased the array length to `RANGE = 16*1024`.
>
> I disabled `turbo-boost`.
> Machine: `11th Gen Intel® Core™ i7-11850H @ 2.50GHz × 16`.
> Full `avx512` support, including `avx512dq` required for `MulReductionVL`.
>
>
> operation M-N-2 M-N-3 M-2 M-3 P-2 P-3 | note |
> ---------------------------------------------------------------
> int add 2063 2085 660 530 415 283 | |
> int mul 2272 2257 1189 733 908 439 | |
> int min 2527 2520 2516 2579 2585 2542 | 1 |
> int max 2548 2525 2551 2516 2515 2517 | 1 |
> int and 2410 2414 602 480 353 263 | |
> int or 2149 2151 597 498 354 262 | |
> int xor 2059 2062 605 476 364 263 | |
> long add 1776 1790 2000 1000 1683 591 | 2 |
> long mul 2135 2199 2185 2001 2176 1307 | 2 |
> long min 1439 1424 1421 1420 1430 1427 | 3 |
> long max 2299 2287 2303 2305 1433 1425 | 3 |
> long and 1657 1667 2015 1003 1679 568 | 4 |
> long or 1776 1783 2032 1009 1680 569 | 4 |
> long xor 1834 1783 2012 1024 1679 570 | 4 |
> float add 2779 2644 2633 2648 2632 2639 | 5 |
> float mul 2779 2871 2810 2776 2732 2791 | 5 |
> float min 2294 2620 1725 1286 872 672 | |
> float max 2371 2519 1697 1265 841 468 | |
> double add 2634 2636 2635 2650 2635 2648 | 5 |
> double mul 3053 2955 2881 3030 2979 2927 | 5 |
> double min 2364 2400 2439 2399 2486 2398 | 6 |
> double max 2488 2459 2501 2451 2493 2498 | 6 |
>
> Legend: `M` master, `P` with patch, `N` no superword reductions (`-XX:-SuperWordReductions`), `2` AVX2, `3` AVX512.
>
> The lines without note show clear speedup as expected.
>
> Notes:
> 1. `int min/max`: bug [JDK-8302673](https://bugs.openjdk.org/browse/JDK-8302673)
> 2. `long add/mul`: without the patch, it seems that vectorization actually would be slower. Even now, only AVX512 really leads to a speedup. Note: `MulReductionVL` requires `avx512dq`.
> 3. `long min/max`: `Math.max(long, long)` is currently not intrinsified [JDK-8307513](https://bugs.openjdk.org/browse/JDK-8307513).
> 4. `long and/or/xor`: without patch on AVX2, vectorization is slower. With patch, it is always faster now.
> 5. `float/double add/mul`: IEEE requires linear reduction. This cannot be moved outside loop. Vectorization has no benefit in these examples.
> 6. `double min/max`: bug [JDK-8300865](https://bugs.openjdk.org/browse/JDK-8300865).
>
> **Testing**
>
> I modified the reduction IR tests, so that they expect at most 2 Reduction nodes (one per main-loop, and optionally one for the vectorized post-loop). Before my patch, these IR tests would find many Reduction nodes, and would have failed. This is because after SuperWord, we unroll the loop multiple times, and so we clone the Reduction nodes inside the main loop.
>
> Passes up to tier5 and stress-testing.
> Performance testing did not show any regressions.
> **TODO** can someone benchmark on `aarch64`?
>
> **Discussion**
>
> We should investigate if we can now allow reductions more eagerly, at least for `UnorderedReduction`, as the overhead is now much lower. @jatin-bhateja pointed to this:
> https://github.com/openjdk/jdk/blob/941a7ac7dab243c6033a78880fd31faa803e62ab/src/hotspot/share/opto/superword.cpp#L2265
> I filed [JDK-8307516](https://bugs.openjdk.org/browse/JDK-8307516).
>
> So far, I did not work on `byte, char, short`, we can investigate this in the future.
>
> FYI: I investigated if this may be helpful for the Vector API. As far as I can see, Reductions are only introduced with a vector-iunput, and the scalar-input is always the identity-element. This optimization here assumes that we have the Phi-loop going through the scalar-input. So I think this optimization here really only helps `SuperWord` for now.
Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision:
use is_counted and is_innermost
-------------
Changes:
- all: https://git.openjdk.org/jdk/pull/13056/files
- new: https://git.openjdk.org/jdk/pull/13056/files/72fa58e0..31d977c2
Webrevs:
- full: https://webrevs.openjdk.org/?repo=jdk&pr=13056&range=04
- incr: https://webrevs.openjdk.org/?repo=jdk&pr=13056&range=03-04
Stats: 2 lines in 2 files changed: 0 ins; 0 del; 2 mod
Patch: https://git.openjdk.org/jdk/pull/13056.diff
Fetch: git fetch https://git.openjdk.org/jdk.git pull/13056/head:pull/13056
PR: https://git.openjdk.org/jdk/pull/13056
More information about the hotspot-compiler-dev
mailing list