RFR: 8287087: C2: perform SLP reduction analysis on-demand

Thu Mar 23 10:12:43 UTC 2023

On Tue, 21 Mar 2023 14:49:26 GMT, Roberto Castañeda Lozano <rcastanedalo at openjdk.org> wrote:

> Reduction analysis finds cycles of reduction operations within loops. The result of this analysis is used by SLP auto-vectorization (to vectorize reductions if deemed profitable) and by x64 instruction matching (to select specialized scalar floating-point `Math.min()/max()` implementations). Currently, reduction analysis is applied early (before loop unrolling), and the result is propagated through loop unrolling by marking nodes and loops with special reduction flags. Applying reduction analysis early is efficient, but propagating the results correctly through loop unrolling and arbitrary graph transformations is challenging and often leads to inconsistent node-loop reduction flag states, some of which have led to actual miscompilations in the past (see [JDK-8261147](https://bugs.openjdk.org/browse/JDK-8261147) and [JDK-8279622](https://bugs.openjdk.org/browse/JDK-8279622)).
> 
> This changeset postpones reduction analysis to the point where its results are actually used. To do so, it generalizes the analysis to find reduction cycles on unrolled loops:
> 
> ![reduction-before-after-unrolling](https://user-images.githubusercontent.com/8792647/226725587-b7d68509-3717-4bbe-8d54-f9a105853fda.png)
> 
> The generalized analysis precludes the need to maintain and propagate node and loop reduction flags through arbitrary IR transformations, reducing the risk of miscompilations due to invalidation of the analysis results. The generalization is slightly more costly than the current analysis, but still negligible in micro- and general benchmarks.
> 
> ## Performance Benefits
> 
>  As a side benefit, the proposed generalization is able to find more reductions, increasing the scope of auto-vectorization and the performance of x64 floating-point `Math.min()/max()` in multiple scenarios.
> 
> ### Increased Auto-Vectorization Scope
> 
> There are two main scenarios in which the proposed changeset enables further auto-vectorization:
> 
> #### Reductions Using Global Accumulators
> 
> 
> public class Foo {
>   int acc = 0;
>   (..)
>   void reduce(int[] array) {
>     for (int i = 0; i < array.length; i++) {
>       acc += array[i];
>     }
>   }
> }
> 
> Initially, such reductions are wrapped by load and store nodes, which defeats the current reduction analysis. However, after unrolling and other optimizations are applied, the reduction becomes recognizable by the proposed analysis:
> 
> ![global-reduction-before-after-unrolling](https://user-images.githubusercontent.com/8792647/226745351-33494e40-7c07-4a8b-8bf6-d3a96e84b1c2.png)
> 
> #### Reductions of partially unrolled loops
> 
> 
>     (..)
>     for (int i = 0; i < array.length / 2; i++) {
>       acc += array[2*i];
>       acc += array[2*i + 1];
>     }
>     (..)
> 
> 
> These reductions are manually unrolled from the beginning, so the current reduction analysis fails to find them, while the proposed analysis is able to detect them as if they were unrolled automatically.
> 
> ### Increased Performance of x64 Floating-Point `Math.min()/max()`
> 
> Besides the above scenarios, the proposed generalization allows the x64 matcher to select specialized floating-point `Math.min()/max()` implementations for reductions in non-counted and outer loops (see the new micro-benchmarks in `FpMinMaxIntrinsics.java` for more details).
> 
> ## Implementation details
> 
> The generalized reduction analysis finds reductions in a loop by looking for chains of reduction operators of the same node type starting and finishing on each phi node in the loop. To avoid a combinatorial explosion, the analysis assumes that all nodes in a chain are connected via the same edge index, which is realistic because chains usually consist of identical nodes cloned by loop unrolling. This assumption allows the analysis to test only two paths for each examined phi node. A failure of this assumption (e.g. as illustrated in test case `testReductionOnPartiallyUnrolledLoopWithSwappedInputs` from `TestGeneralizedReductions.java`) results in mising vectorization but does not affect correctness. Note that the same-index assumption can only fail in cases where current auto-vectorization would also fail to vectorize (manually unrolled loops).
> 
> A complication results from edge swapping in the nodes cloned by loop unrolling (see [here](https://github.com/openjdk/jdk/blob/bbde2158d1d11be909292d0c8625211e6cf5359e/src/hotspot/share/opto/addnode.cpp#L123) and [here](https://github.com/openjdk/jdk/blob/bbde2158d1d11be909292d0c8625211e6cf5359e/src/hotspot/share/opto/mulnode.cpp#L113)), which can lead to reduction chains connected via different input indices. This is addressed by tracking whether nodes have swapped edges and adjusting the explored input indices in the reduction analysis accordingly. An alternative (proposed by @eme64) is to replace this changeset's linear chain finding approach with a more general shortest path-finding algorithm. This alternative might preclude the need for tracking edge swapping at a potentially higher computational cost. Since the trade-off is not obvious, I propose to investigate it in a follow-up RFE.
> 
> The changeset implements a more relaxed version of the reduction analysis for x64 matching, suitable for queries on single nodes. This analysis is run only in the presence of `[Min|Max][F|D]` nodes.
> 
> ## Testing
> 
> ### Functionality
> 
> - tier1-5 (linux-x64, linux-aarch64, windows-x64, macosx-x64, and macosx-aarch64).
> - fuzzing (12 h. on linux-x64 and linux-aarch64).
> 
> ##### TestGeneralizedReductions.java
> 
> Tests the new scenarios in which vectorization occurs. These tests are restricted to 64-bits platforms, since I do not have access to 32-bits ones. `testReductionOnPartiallyUnrolledLoop` has been observed to fail on [linux-x86](https://github.com/robcasloz/jdk/actions/runs/4478959520/jobs/7873827856#logs) due to missing vectorization. If anyone wants to have a look and derive the necessary IR test framework preconditions for the test to pass on linux-x86, I am happy to lift the 64-bits restriction.
> 
> ##### TestFpMinMaxReductions.java
> 
> Tests the matching of floating-point max/min implementations in x64.
> 
> ##### TestSuperwordFailsUnrolling.java
> 
> This test file is updated to ensure auto-vectorization is never triggered, because this changeset would otherwise enable it and defeat the purpose of the test.
> 
> ### Performance
> 
> #### General Benchmarks
> 
> The changeset does not cause any performance regression on the DaCapo, SPECjvm 2008, and SPECjbb2015 benchmark suites for linux-x64 and linux-aarch64.
> 
> #### Micro-benchmarks
> 
> The changeset extends two existing files with additional micro-benchmarks that show the benefit of the generalized reduction analysis ([full results](https://github.com/openjdk/jdk/files/11039207/microbenchmark-results.ods)).
> 
> 
> ##### VectorReduction.java
> 
> These micro-benchmarks are first adjusted to actually vectorize in the mainline approach, since they suffered from the global-accumulator limitation. Two micro-benchmarks are added to exercise vectorization in the presence of global accumulators and partially unrolled loops. Running `VectorReduction.java` on an x64 (Cascade Lake) machine confirms the expectations: compared to mainline (with the adjustment mentioned above), this changeset yields similar performance results except for `andRedIOnGlobalAccumulator` and `andRedIPartiallyUnrolled`, where the changeset improves performance by 2.4x in both cases.
> 
> ##### MaxIntrinsics.java
> 
> This file is extended with four new micro-benchmarks. Running it on the same machine as above shows that the changeset does not affect the performance of the existing micro-benchmarks, and improves moderately to substantially the performance of the new ones (because it allows the x64 matcher to select a floating-point `Math.min()` implementation that is specialized for reduction min operations):
> 
> | micro-benchmark             | speedup compared to mainline |
> | ---                         | ---                          |
> | `fMinReduceInOuterLoop`       | 1.1x                         |
> | `fMinReduceNonCounted`        | 2.3x                         |
> | `fMinReduceGlobalAccumulator` | 2.4x                         |
> | `fMinReducePartiallyUnrolled` | 3.9x                         |
> 
> ## Acknowledgments
> 
> Thanks to @danielogh for making it possible to test this improvement with confidence ([JDK-8294715](https://bugs.openjdk.org/browse/JDK-8294715)) and to @TobiHartmann, @chhagedorn, @vnkozlov and @eme64 for discussions and useful feedback.

Thanks for your comments and suggestions @eme64! I addressed (or discussed) all of them, will do some testing before updating the changeset.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/13120#issuecomment-1480918361