RFR: 8325651: C2 SuperWord: refactor the dependency graph

Emanuel Peter epeter at openjdk.org
Wed Feb 28 13:22:14 UTC 2024


Subtask of https://github.com/openjdk/jdk/pull/16620.

**Goal**

- Make the dependency graph, make it a module of `VLoopAnalyzer` -> `VLoopDependencyGraph`.
- Refactoring: replace linked-list edges with a compact array for each node.
- No behavioral change to vectorization.

**Benchmark**

I have a large loop body (extra large with hand-unrolling and ` -XX:LoopUnrollLimit=1000`).
All stores are connected to all previous loads, effectively creating an `O(n^2)` size graph,
ensuring that we spend a lot of time on the dependency graph compared to other components.

Measured on `linux-x64` and turbo disabled.

Measuring Compile time difference:
`/oracle-work/jdk-fork2/build/linux-x64/jdk/bin/java -XX:CompileCommand=compileonly,TestGraph::test* -XX:+CITime -XX:+UnlockDiagnosticVMOptions -XX:RepeatCompilation=20 -XX:LoopUnrollLimit=1000 -Xbatch TestGraph.java`

TestGraph.java

public class TestGraph {
    static int RANGE = 100_000;

    public static void main(String[] args) {
        int[] a = new int[RANGE];
        int[] b = new int[RANGE];
        for (int i = 0; i < 10_000; i++) {
            test1(a, b, i % 100);
        }
    }

    static void test1(int[] a, int[] b, int offset) {
        for (int i = 0; i < RANGE/16-200; i++) {
            a[i * 16 +  0] = b[i * 16 +  0 + offset];
            a[i * 16 +  1] = b[i * 16 +  1 + offset];
            a[i * 16 +  2] = b[i * 16 +  2 + offset];
            a[i * 16 +  3] = b[i * 16 +  3 + offset];
            a[i * 16 +  4] = b[i * 16 +  4 + offset];
            a[i * 16 +  5] = b[i * 16 +  5 + offset];
            a[i * 16 +  6] = b[i * 16 +  6 + offset];
            a[i * 16 +  7] = b[i * 16 +  7 + offset];
            a[i * 16 +  8] = b[i * 16 +  8 + offset];
            a[i * 16 +  9] = b[i * 16 +  9 + offset];
            a[i * 16 + 10] = b[i * 16 + 10 + offset];
            a[i * 16 + 11] = b[i * 16 + 11 + offset];
            a[i * 16 + 12] = b[i * 16 + 12 + offset];
            a[i * 16 + 13] = b[i * 16 + 13 + offset];
            a[i * 16 + 14] = b[i * 16 + 14 + offset];
            a[i * 16 + 15] = b[i * 16 + 15 + offset];
        }
    }
}



Before:

    C2 Compile Time:       14.588 s
...
         IdealLoop:            13.670 s
           AutoVectorize:      11.703 s```

After:

    C2 Compile Time:       14.468 s
...
         IdealLoop:            13.595 s
           AutoVectorize:      11.539 s


Memory usage: `-XX:CompileCommand=memstat,TestGraph::test*,print`
Before: `8_225_680 - 8_258_408` bytes.
After: `7_384_448 - 7_351_720` bytes.

Conclusion:
The difference in AutoVectorization time with/without the patch is marginal, possibly slightly improving with the patch.
We see that a very large percentage of compile time is spent on AutoVectorization, however this is not representative of normal compilations with a much lower `LoopUnrollLimit`. Memory usage seems to be significantly improved with the patch, about `10%` over the whole compilation. Of course this is inflated with this special benchmark.

-------------

Commit messages:
 - typo
 - fix depth of Phi node
 - more cosmetics
 - improve comments
 - make depth private
 - replace old graph with new graph
 - cosmetic stuff
 - compute depth
 - DefIterator
 - cosmetic fix
 - ... and 6 more: https://git.openjdk.org/jdk/compare/490825fb...08b8df3f

Changes: https://git.openjdk.org/jdk/pull/17812/files
 Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=17812&range=00
  Issue: https://bugs.openjdk.org/browse/JDK-8325651
  Stats: 709 lines in 5 files changed: 281 ins; 404 del; 24 mod
  Patch: https://git.openjdk.org/jdk/pull/17812.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/17812/head:pull/17812

PR: https://git.openjdk.org/jdk/pull/17812


More information about the hotspot-compiler-dev mailing list