RFR: 8302652: [SuperWord] Reduction should happen after loop, when possible [v7]

Tue May 16 17:51:57 UTC 2023

On Mon, 15 May 2023 11:05:06 GMT, Emanuel Peter <epeter at openjdk.org> wrote:

>> https://github.com/openjdk/jdk/blob/cc9e7e8e773e773af87615fdae037a8f8ea82635/src/hotspot/share/opto/loopopts.cpp#L4125-L4171
>> 
>> I introduced a new abstract node type `UnorderedReductionNode` (subtype of `ReductionNode`). All of the reductions that can be re-ordered are to extend from this node type: `int/long add/mul/and/or/xor/min/max`, as well as `float/double min/max`. `float/double add/mul` do not allow for reordering of operations.
>> 
>> The optimization is part of loop-opts, and called after `SuperWord` in `PhaseIdealLoop::build_and_optimize`.
>> 
>> **Performance results**
>> I ran `test/hotspot/jtreg/compiler/loopopts/superword/ReductionPerf.java`, with `2_000` warmup and `100_000` perf iterations. I also increased the array length to `RANGE = 16*1024`.
>> 
>> I disabled `turbo-boost`.
>> Machine: `11th Gen Intel® Core™ i7-11850H @ 2.50GHz × 16`.
>> Full `avx512` support, including `avx512dq` required for `MulReductionVL`.
>> 
>> 
>> operation     M-N-2  M-N-3  M-2    M-3    P-2    P-3   | note |
>> ---------------------------------------------------------------
>> int add       2063   2085   660    530    415    283   |      |
>> int mul       2272   2257   1189   733    908    439   |      |
>> int min       2527   2520   2516   2579   2585   2542  | 1    |
>> int max       2548   2525   2551   2516   2515   2517  | 1    |
>> int and       2410   2414   602    480    353    263   |      |
>> int or        2149   2151   597    498    354    262   |      |
>> int xor       2059   2062   605    476    364    263   |      |
>> long add      1776   1790   2000   1000   1683   591   | 2    |
>> long mul      2135   2199   2185   2001   2176   1307  | 2    |
>> long min      1439   1424   1421   1420   1430   1427  | 3    |
>> long max      2299   2287   2303   2305   1433   1425  | 3    |
>> long and      1657   1667   2015   1003   1679   568   | 4    |
>> long or       1776   1783   2032   1009   1680   569   | 4    |
>> long xor      1834   1783   2012   1024   1679   570   | 4    |
>> float add     2779   2644   2633   2648   2632   2639  | 5    |
>> float mul     2779   2871   2810   2776   2732   2791  | 5    |
>> float min     2294   2620   1725   1286   872    672   |      |
>> float max     2371   2519   1697   1265   841    468   |      |
>> double add    2634   2636   2635   2650   2635   2648  | 5    |
>> double mul    3053   2955   2881   3030   2979   2927  | 5    |
>> double min    2364   2400   2439   2399   2486   2398  | 6    |
>> double max    2488   2459   2501 ...
>
> Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision:
> 
>   whitespace fix

src/hotspot/share/opto/vectornode.hpp line 244:

> 242: 
> 243:   virtual VectorNode* make_normal_vector_op(Node* in1, Node* in2, const TypeVect* vt) = 0;
> 244:   virtual bool make_normal_vector_op_implemented(const TypeVect* vt) = 0;

How about introducing `virtual int vect_Opcode()` (`norm_vect_Opcode()`) or something which returns normal vector opcode (`Op_AddVI` for `AddReductionVINode` for example). Then you don't need these 2 functions to be virtual:

  virtual int vect_Opcode() const = 0;
  VectorNode* make_normal_vector_op(Node* in1, Node* in2, const TypeVect* vt) {
    return new VectorNode::make(vect_Opcode(), in1, in2, vt);
  }
  bool make_normal_vector_op_implemented(const TypeVect* vt) {
    return Matcher::match_rule_supported_vector(vect_Opcode(), vt->length(), vt->element_basic_type());
  }

If we need that in more cases then in your changes may be have even more general (in `VectorNode` class) `scalar_Opcode()` and use `VectorNode::opcode(sclar_Opcode(), vt->element_basic_type())` to get normal vector opcode. This may need more changes and testing - a separate RFE.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/13056#discussion_r1195504471