RFR: 8302652: [SuperWord] Reduction should happen after loop, when possible [v7]

Wed May 17 06:13:56 UTC 2023

On Mon, 15 May 2023 11:05:06 GMT, Emanuel Peter <epeter at openjdk.org> wrote:

>> https://github.com/openjdk/jdk/blob/cc9e7e8e773e773af87615fdae037a8f8ea82635/src/hotspot/share/opto/loopopts.cpp#L4125-L4171
>> 
>> I introduced a new abstract node type `UnorderedReductionNode` (subtype of `ReductionNode`). All of the reductions that can be re-ordered are to extend from this node type: `int/long add/mul/and/or/xor/min/max`, as well as `float/double min/max`. `float/double add/mul` do not allow for reordering of operations.
>> 
>> The optimization is part of loop-opts, and called after `SuperWord` in `PhaseIdealLoop::build_and_optimize`.
>> 
>> **Performance results**
>> I ran `test/hotspot/jtreg/compiler/loopopts/superword/ReductionPerf.java`, with `2_000` warmup and `100_000` perf iterations. I also increased the array length to `RANGE = 16*1024`.
>> 
>> I disabled `turbo-boost`.
>> Machine: `11th Gen Intel® Core™ i7-11850H @ 2.50GHz × 16`.
>> Full `avx512` support, including `avx512dq` required for `MulReductionVL`.
>> 
>> 
>> operation     M-N-2  M-N-3  M-2    M-3    P-2    P-3   | note |
>> ---------------------------------------------------------------
>> int add       2063   2085   660    530    415    283   |      |
>> int mul       2272   2257   1189   733    908    439   |      |
>> int min       2527   2520   2516   2579   2585   2542  | 1    |
>> int max       2548   2525   2551   2516   2515   2517  | 1    |
>> int and       2410   2414   602    480    353    263   |      |
>> int or        2149   2151   597    498    354    262   |      |
>> int xor       2059   2062   605    476    364    263   |      |
>> long add      1776   1790   2000   1000   1683   591   | 2    |
>> long mul      2135   2199   2185   2001   2176   1307  | 2    |
>> long min      1439   1424   1421   1420   1430   1427  | 3    |
>> long max      2299   2287   2303   2305   1433   1425  | 3    |
>> long and      1657   1667   2015   1003   1679   568   | 4    |
>> long or       1776   1783   2032   1009   1680   569   | 4    |
>> long xor      1834   1783   2012   1024   1679   570   | 4    |
>> float add     2779   2644   2633   2648   2632   2639  | 5    |
>> float mul     2779   2871   2810   2776   2732   2791  | 5    |
>> float min     2294   2620   1725   1286   872    672   |      |
>> float max     2371   2519   1697   1265   841    468   |      |
>> double add    2634   2636   2635   2650   2635   2648  | 5    |
>> double mul    3053   2955   2881   3030   2979   2927  | 5    |
>> double min    2364   2400   2439   2399   2486   2398  | 6    |
>> double max    2488   2459   2501 ...
>
> Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision:
> 
>   whitespace fix

src/hotspot/share/opto/vectornode.hpp line 438:

> 436:   virtual bool make_normal_vector_op_implemented(const TypeVect* vt) {
> 437:     return Matcher::match_rule_supported_vector(Op_MulVI, vt->length(), vt->element_basic_type());
> 438:   }

I agree with Vladimir's comments, we can remove explicit calls from each reduction node class and introduce factory method _VectorNode::make_from_ropc_ in vectornode.cpp similar to _ReductionNode::make_from_vopc_ which accept reduction opcode and returns equivalent vector node.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/13056#discussion_r1195969474