RFR: 8287087: C2: perform SLP reduction analysis on-demand

Wed Mar 22 11:11:46 UTC 2023

Reduction analysis finds cycles of reduction operations within loops. The result of this analysis is used by SLP auto-vectorization (to vectorize reductions if deemed profitable) and by x64 instruction matching (to select specialized scalar floating-point `Math.min()/max()` implementations). Currently, reduction analysis is applied early (before loop unrolling), and the result is propagated through loop unrolling by marking nodes and loops with special reduction flags. Applying reduction analysis early is efficient, but propagating the results correctly through loop unrolling and arbitrary graph transformations is challenging and often leads to inconsistent node-loop reduction flag states, some of which have led to actual miscompilations in the past (see [JDK-8261147](https://bugs.openjdk.org/browse/JDK-8261147) and [JDK-8279622](https://bugs.openjdk.org/browse/JDK-8279622)).

This changeset postpones reduction analysis to the point where its results are actually used. To do so, it generalizes the analysis to find reduction cycles on unrolled loops:

![reduction-before-after-unrolling](https://user-images.githubusercontent.com/8792647/226725587-b7d68509-3717-4bbe-8d54-f9a105853fda.png)

The generalized analysis precludes the need to maintain and propagate node and loop reduction flags through arbitrary IR transformations, reducing the risk of miscompilations due to invalidation of the analysis results. The generalization is slightly more costly than the current analysis, but still negligible in micro- and general benchmarks.

## Performance Benefits

 As a side benefit, the proposed generalization is able to find more reductions, increasing the scope of auto-vectorization and the performance of x64 floating-point `Math.min()/max()` in multiple scenarios.

### Increased Auto-Vectorization Scope

There are two main scenarios in which the proposed changeset enables further auto-vectorization:

#### Reductions Using Global Accumulators

public class Foo {
  int acc = 0;
  (..)
  void reduce(int[] array) {
    for (int i = 0; i < array.length; i++) {
      acc += array[i];
    }
  }
}

Initially, such reductions are wrapped by load and store nodes, which defeats the current reduction analysis. However, after unrolling and other optimizations are applied, the reduction becomes recognizable by the proposed analysis:

![global-reduction-before-after-unrolling](https://user-images.githubusercontent.com/8792647/226745351-33494e40-7c07-4a8b-8bf6-d3a96e84b1c2.png)

#### Reductions of partially unrolled loops

    (..)
    for (int i = 0; i < array.length / 2; i++) {
      acc += array[2*i];
      acc += array[2*i + 1];
    }
    (..)

These reductions are manually unrolled from the beginning, so the current reduction analysis fails to find them, while the proposed analysis is able to detect them as if they were unrolled automatically.

### Increased Performance of x64 Floating-Point `Math.min()/max()`

Besides the above scenarios, the proposed generalization allows the x64 matcher to select specialized floating-point `Math.min()/max()` implementations for reductions in non-counted and outer loops (see the new micro-benchmarks in `FpMinMaxIntrinsics.java` for more details).

## Implementation details

The generalized reduction analysis finds reductions in a loop by looking for chains of reduction operators of the same node type starting and finishing on each phi node in the loop. To avoid a combinatorial explosion, the analysis assumes that all nodes in a chain are connected via the same edge index, which is realistic because chains usually consist of identical nodes cloned by loop unrolling. This assumption allows the analysis to test only two paths for each examined phi node. A failure of this assumption (e.g. as illustrated in test case `testReductionOnPartiallyUnrolledLoopWithSwappedInputs` from `TestGeneralizedReductions.java`) results in mising vectorization but does not affect correctness. Note that the same-index assumption can only fail in cases where current auto-vectorization would also fail to vectorize (manually unrolled loops).

A complication results from edge swapping in the nodes cloned by loop unrolling (see [here](https://github.com/openjdk/jdk/blob/bbde2158d1d11be909292d0c8625211e6cf5359e/src/hotspot/share/opto/addnode.cpp#L123) and [here](https://github.com/openjdk/jdk/blob/bbde2158d1d11be909292d0c8625211e6cf5359e/src/hotspot/share/opto/mulnode.cpp#L113)), which can lead to reduction chains connected via different input indices. This is addressed by tracking whether nodes have swapped edges and adjusting the explored input indices in the reduction analysis accordingly. An alternative (proposed by @eme64) is to replace this changeset's linear chain finding approach with a more general shortest path-finding algorithm. This alternative might preclude the need for tracking edge swapping at a potentially higher computational cost. Since the trade-off is not obvious, I propose to investigate it in a follow-up RFE.

The changeset implements a more relaxed version of the reduction analysis for x64 matching, suitable for queries on single nodes. This analysis is run only in the presence of `[Min|Max][F|D]` nodes.

## Testing

### Functionality

- tier1-5 (linux-x64, linux-aarch64, windows-x64, macosx-x64, and macosx-aarch64).
- fuzzing (12 h. on linux-x64 and linux-aarch64).

##### TestGeneralizedReductions.java

Tests the new scenarios in which vectorization occurs. These tests are restricted to 64-bits platforms, since I do not have access to 32-bits ones. `testReductionOnPartiallyUnrolledLoop` has been observed to fail on [linux-x86](https://github.com/robcasloz/jdk/actions/runs/4478959520/jobs/7873827856#logs) due to missing vectorization. If anyone wants to have a look and derive the necessary IR test framework preconditions for the test to pass on linux-x86, I am happy to lift the 64-bits restriction.

##### TestFpMinMaxReductions.java

Tests the matching of floating-point max/min implementations in x64.

##### TestSuperwordFailsUnrolling.java

This test file is updated to ensure auto-vectorization is never triggered, because this changeset would otherwise enable it and defeat the purpose of the test.

### Performance

#### General Benchmarks

The changeset does not cause any performance regression on the DaCapo, SPECjvm 2008, and SPECjbb2015 benchmark suites for linux-x64 and linux-aarch64.

#### Micro-benchmarks

The changeset extends two existing files with additional micro-benchmarks that show the benefit of the generalized reduction analysis ([full results](https://github.com/openjdk/jdk/files/11039207/microbenchmark-results.ods)).

##### VectorReduction.java

These micro-benchmarks are first adjusted to actually vectorize in the mainline approach, since they suffered from the global-accumulator limitation. Two micro-benchmarks are added to exercise vectorization in the presence of global accumulators and partially unrolled loops. Running `VectorReduction.java` on an x64 (Cascade Lake) machine confirms the expectations: compared to mainline (with the adjustment mentioned above), this changeset yields similar performance results except for `andRedIOnGlobalAccumulator` and `andRedIPartiallyUnrolled`, where the changeset improves performance by 2.4x in both cases.

##### MaxIntrinsics.java

This file is extended with four new micro-benchmarks. Running it on the same machine as above shows that the changeset does not affect the performance of the existing micro-benchmarks, and improves moderately to substantially the performance of the new ones (because it allows the x64 matcher to select a floating-point `Math.min()` implementation that is specialized for reduction min operations):

| micro-benchmark             | speedup compared to mainline |
| ---                         | ---                          |
| `fMinReduceInOuterLoop`       | 1.1x                         |
| `fMinReduceNonCounted`        | 2.3x                         |
| `fMinReduceGlobalAccumulator` | 2.4x                         |
| `fMinReducePartiallyUnrolled` | 3.9x                         |

## Acknowledgments

Thanks to @danielogh for making it possible to test this improvement with confidence ([JDK-8294715](https://bugs.openjdk.org/browse/JDK-8294715)) and to @TobiHartmann, @chhagedorn, @vnkozlov and @eme64 for discussions and useful feedback.

-------------

Commit messages:
 - Do not run test in x86-32
 - Update existing test instead of removing it
 - Add negative vectorization test
 - Update copyright headers
 - Add two more reduction vectorization microbenchmarks
 - Ensure that the reduction vectorization microbenchmarks actually vectorize
 - Add new FP min/max micro-benchmarks
 - Add one more FP min/max test case
 - Refine check conditions
 - Run min/max reduction tests only for UseAVX > 0
 - ... and 13 more: https://git.openjdk.org/jdk/compare/7277bb19...72fe5a6a

Changes: https://git.openjdk.org/jdk/pull/13120/files
 Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=13120&range=00
  Issue: https://bugs.openjdk.org/browse/JDK-8287087
  Stats: 801 lines in 17 files changed: 639 ins; 99 del; 63 mod
  Patch: https://git.openjdk.org/jdk/pull/13120.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/13120/head:pull/13120

PR: https://git.openjdk.org/jdk/pull/13120