RFR: 8302652: [SuperWord] Reduction should happen after loop, when possible [v5]

Thu May 11 07:58:49 UTC 2023

On Wed, 10 May 2023 11:45:38 GMT, Emanuel Peter <epeter at openjdk.org> wrote:

>> https://github.com/openjdk/jdk/blob/cc9e7e8e773e773af87615fdae037a8f8ea82635/src/hotspot/share/opto/loopopts.cpp#L4125-L4171
>> 
>> I introduced a new abstract node type `UnorderedReductionNode` (subtype of `ReductionNode`). All of the reductions that can be re-ordered are to extend from this node type: `int/long add/mul/and/or/xor/min/max`, as well as `float/double min/max`. `float/double add/mul` do not allow for reordering of operations.
>> 
>> The optimization is part of loop-opts, and called after `SuperWord` in `PhaseIdealLoop::build_and_optimize`.
>> 
>> **Performance results**
>> I ran `test/hotspot/jtreg/compiler/loopopts/superword/ReductionPerf.java`, with `2_000` warmup and `100_000` perf iterations. I also increased the array length to `RANGE = 16*1024`.
>> 
>> I disabled `turbo-boost`.
>> Machine: `11th Gen Intel® Core™ i7-11850H @ 2.50GHz × 16`.
>> Full `avx512` support, including `avx512dq` required for `MulReductionVL`.
>> 
>> 
>> operation     M-N-2  M-N-3  M-2    M-3    P-2    P-3   | note |
>> ---------------------------------------------------------------
>> int add       2063   2085   660    530    415    283   |      |
>> int mul       2272   2257   1189   733    908    439   |      |
>> int min       2527   2520   2516   2579   2585   2542  | 1    |
>> int max       2548   2525   2551   2516   2515   2517  | 1    |
>> int and       2410   2414   602    480    353    263   |      |
>> int or        2149   2151   597    498    354    262   |      |
>> int xor       2059   2062   605    476    364    263   |      |
>> long add      1776   1790   2000   1000   1683   591   | 2    |
>> long mul      2135   2199   2185   2001   2176   1307  | 2    |
>> long min      1439   1424   1421   1420   1430   1427  | 3    |
>> long max      2299   2287   2303   2305   1433   1425  | 3    |
>> long and      1657   1667   2015   1003   1679   568   | 4    |
>> long or       1776   1783   2032   1009   1680   569   | 4    |
>> long xor      1834   1783   2012   1024   1679   570   | 4    |
>> float add     2779   2644   2633   2648   2632   2639  | 5    |
>> float mul     2779   2871   2810   2776   2732   2791  | 5    |
>> float min     2294   2620   1725   1286   872    672   |      |
>> float max     2371   2519   1697   1265   841    468   |      |
>> double add    2634   2636   2635   2650   2635   2648  | 5    |
>> double mul    3053   2955   2881   3030   2979   2927  | 5    |
>> double min    2364   2400   2439   2399   2486   2398  | 6    |
>> double max    2488   2459   2501 ...
>
> Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision:
> 
>   use is_counted and is_innermost

> I agree with the phasing of the optimization as it gives us an opportunity to perform similar optimization for Vectors created at parse time i.e., though VectorAPIs.
> 
> ![image](https://user-images.githubusercontent.com/59989778/237607853-e786cc8b-fca8-49f4-814a-788c429ed473.png)
> 
> VectorAPI based kernel will have a different graph shape and proposed pattern matching will not be able to handle it. Also, trip count is feeding into LoadVector and scalar reduction operation (scalar add in above example) is secondary an induction variable. I think we can still handle it in a follow up patch by doing a two pass over loop.
> 
> * Scan loop body and collect all the UnorderdReduction and their users.
> * Exist optimization if any of following condition holds good.
>   
>   * Different UnorderedReduction have different reduction opcodes.
>   * Reduction node has more than one user.
> * If above conditions are met, then your algorithm will have to traverse the Scalar operations chain and check if one of its input is UnorderedReduction and other input should be driven by same graph pattern.
> * Once we find a legal graph pallet then replace Reductions with Vector counterparts and move reduction out of loop as is being done currently by your patch.
> 
> ![image](https://user-images.githubusercontent.com/59989778/237607878-ed54aabe-6438-4256-b774-9010c04999ae.png)

BTW, with VectorAPI users are expected to be more intelligent and your optimizations can be directly implemented in kernel which perform VectorADD operations in main loop followed by Reduction out of loop e.g.

  outer_loop : 
        hand_unrolled_vector_loop:
               v1 = VectorADD(broadcast(0))
               v2 = v1.VectorADD(LoadVector)
               v3 = v2.VectorADD(LoadVector)
               ...
               ...
      inner_loop_end
      res  += v3.ReductionAdd()
  outer_loop_end

So its not a pressing issue anyways for us.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/13056#issuecomment-1543507479