RFR: 8302652: [SuperWord] Reduction should happen after loop, when possible

Tue May 9 02:30:28 UTC 2023

On Fri, 5 May 2023 07:33:03 GMT, Emanuel Peter <epeter at openjdk.org> wrote:

>> https://github.com/openjdk/jdk/blob/cc9e7e8e773e773af87615fdae037a8f8ea82635/src/hotspot/share/opto/loopopts.cpp#L4125-L4171
>> 
>> I introduced a new abstract node type `UnorderedReductionNode` (subtype of `ReductionNode`). All of the reductions that can be re-ordered are to extend from this node type: `int/long add/mul/and/or/xor/min/max`, as well as `float/double min/max`. `float/double add/mul` do not allow for reordering of operations.
>> 
>> The optimization is part of loop-opts, and called after `SuperWord` in `PhaseIdealLoop::build_and_optimize`.
>> 
>> **Performance results**
>> I ran `test/hotspot/jtreg/compiler/loopopts/superword/ReductionPerf.java`, with `2_000` warmup and `100_000` perf iterations. I also increased the array length to `RANGE = 16*1024`.
>> 
>> I disabled `turbo-boost`.
>> Machine: `11th Gen Intel® Core™ i7-11850H @ 2.50GHz × 16`.
>> Full `avx512` support, including `avx512dq` required for `MulReductionVL`.
>> 
>> 
>> operation     M-N-2  M-N-3  M-2    M-3    P-2    P-3   | note |
>> ---------------------------------------------------------------
>> int add       2063   2085   660    530    415    283   |      |
>> int mul       2272   2257   1189   733    908    439   |      |
>> int min       2527   2520   2516   2579   2585   2542  | 1    |
>> int max       2548   2525   2551   2516   2515   2517  | 1    |
>> int and       2410   2414   602    480    353    263   |      |
>> int or        2149   2151   597    498    354    262   |      |
>> int xor       2059   2062   605    476    364    263   |      |
>> long add      1776   1790   2000   1000   1683   591   | 2    |
>> long mul      2135   2199   2185   2001   2176   1307  | 2    |
>> long min      1439   1424   1421   1420   1430   1427  | 3    |
>> long max      2299   2287   2303   2305   1433   1425  | 3    |
>> long and      1657   1667   2015   1003   1679   568   | 4    |
>> long or       1776   1783   2032   1009   1680   569   | 4    |
>> long xor      1834   1783   2012   1024   1679   570   | 4    |
>> float add     2779   2644   2633   2648   2632   2639  | 5    |
>> float mul     2779   2871   2810   2776   2732   2791  | 5    |
>> float min     2294   2620   1725   1286   872    672   |      |
>> float max     2371   2519   1697   1265   841    468   |      |
>> double add    2634   2636   2635   2650   2635   2648  | 5    |
>> double mul    3053   2955   2881   3030   2979   2927  | 5    |
>> double min    2364   2400   2439   2399   2486   2398  | 6    |
>> double max    2488   2459   2501   2451   2493   2498  | 6    |
>> 
>> Legend: `M` master, `P` with patch, `N` no superword reductions (`-XX:-SuperWordReductions`), `2` AVX2, `3` AVX512.
>> 
>> The lines without note show clear speedup as expected.
>> 
>> Notes:
>> 1. `int min/max`: bug [JDK-8302673](https://bugs.openjdk.org/browse/JDK-8302673)
>> 2. `long add/mul`: without the patch, it seems that vectorization actually would be slower. Even now, only AVX512 really leads to a speedup. Note: `MulReductionVL` requires `avx512dq`.
>> 3. `long min/max`: `Math.max(long, long)` is currently not intrinsified [JDK-8307513](https://bugs.openjdk.org/browse/JDK-8307513).
>> 4. `long and/or/xor`: without patch on AVX2, vectorization is slower. With patch, it is always faster now.
>> 5. `float/double add/mul`: IEEE requires linear reduction. This cannot be moved outside loop. Vectorization has no benefit in these examples.
>> 6. `double min/max`: bug [JDK-8300865](https://bugs.openjdk.org/browse/JDK-8300865).
>> 
>> **Testing**
>> 
>> I modified the reduction IR tests, so that they expect at most 2 Reduction nodes (one per main-loop, and optionally one for the vectorized post-loop). Before my patch, these IR tests would find many Reduction nodes, and would have failed. This is because after SuperWord, we unroll the loop multiple times, and so we clone the Reduction nodes inside the main loop.
>> 
>> Passes up to tier5 and stress-testing.
>> Performance testing did not show any regressions.
>> **TODO** can someone benchmark on `aarch64`?
>> 
>> **Discussion**
>> 
>> We should investigate if we can now allow reductions more eagerly, at least for `UnorderedReduction`, as the overhead is now much lower. @jatin-bhateja pointed to this:
>> https://github.com/openjdk/jdk/blob/941a7ac7dab243c6033a78880fd31faa803e62ab/src/hotspot/share/opto/superword.cpp#L2265
>> I filed [JDK-8307516](https://bugs.openjdk.org/browse/JDK-8307516).
>> 
>> So far, I did not work on `byte, char, short`, we can investigate this in the future.
>> 
>> FYI: I investigated if this may be helpful for the Vector API. As far as I can see, Reductions are only introduced with a vector-iunput, and the scalar-input is always the identity-element. This optimization here assumes that we have the Phi-loop going through the scalar-input. So I think this optimization here really only helps `SuperWord` for now.
>
> @jatin-bhateja @vnkozlov @sviswa7 I substantially reworked this RFE, and have it working now, and included your suggestions.
> 
> The lagorithm now sits in `PhaseIdealLoop::build_and_optimize` after `SuperWord`. It can now handle chains of `UnorderedReduction`, so that it is more robust agains unrolling.
> 
> The only thing missing for me is:
> 1. Benchmark on `aarch64`. @fg1417 Would you want to have a look at that?
> 2. Wait for the performance testing results.

Hi @eme64 , I tested your benchmark on neon and 128-bit sve machines, also with `2_000` warmup, `100_000` perf iterations and `16*1024` array length. I preprocessed raw data by dividing results on master. 

M: master
M-N: master with -XX:-SuperWordReductions
P: with your patch

128-bit-sve
--------------------------------------------
type	op	M	M-N	P
int	add	1	2.222	0.787
int	mul	1	0.931	0.465
int	min	1	1.000	0.999
int	max	1	0.999	0.999
int	and	1	1.928	0.691
int	or	1	1.866	0.685
int	xor	1	1.924	0.738
long	add	1	1.036	1.001
long	mul	1	0.983	1.001
long	min	1	1.001	0.999
long	max	1	1.002	0.996
long	and	1	1.037	1.001
long	or	1	1.017	1.002
long	xor	1	1.037	1.002
float	add	1	1.894	1.000
float	mul	1	0.973	1.000
float	min	1	2.926	0.758
float	max	1	2.925	0.758
double	add	1	1.472	1.005
double	mul	1	1.105	1.000
double	min	1	1.670	0.866
double	max	1	1.669	0.865

NEON
--------------------------------------------
type	op	M	M-N	P
int	add	1	1.991	0.892
int	mul	1	1.212	0.605
int	min	1	1.007	1.007
int	max	1	1.002	1.004
int	and	1	1.597	0.717
int	or	1	1.566	0.716
int	xor	1	1.594	0.715
long	add	1	1.001	1.000
long	mul	1	1.000	1.000
long	min	1	1.015	1.001
long	max	1	1.001	1.000
long	and	1	1.001	1.000
long	or	1	1.001	1.000
long	xor	1	1.001	1.000
float	add	1	1.000	1.000
float	mul	1	1.000	0.999
float	min	1	2.875	0.765
float	max	1	2.873	0.762
double	add	1	0.999	0.996
double	mul	1	1.001	0.996
double	min	1	1.607	0.862
double	max	1	1.609	0.865

We can see obvious uplift brought by your patch, and almost no regression. Since superword doesn't support 2-element long reduction, see https://github.com/openjdk/jdk/blob/d9052b946682d1c0f2629455d73fe4e6b95b29db/src/hotspot/share/opto/superword.cpp#L2314
no benefit on NEON/128-bit sve machines for long type is expected. Once I can access more than 128 bit sve machines, I can verify it later.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/13056#issuecomment-1539298861