RFR: 8327978: C2 SuperWord: Fix compilation time regression in dependency graph traversal after JDK-8325651
Emanuel Peter
epeter at openjdk.org
Wed Apr 3 13:56:17 UTC 2024
In [JDK-8325651](https://bugs.openjdk.org/browse/JDK-8325651) / https://github.com/openjdk/jdk/pull/17812 I refactored the dependency graph. It seems I made a typo, and missed a single `!`, which broke `VLoopDependencyGraph::compute_depth` (formerly `SuperWord::compute_max_depth`).
The consequence was that all nodes in the dependency graph had the same depth `1`. A node is supposed to have a higher depth than all its inputs, except for Phi nodes, which have depth 0, as they are at the beginning of the loop's basic block, i.e. they are at the beginning of the DAG.
**Details**
Well, it is a bit more complicated. I had not just forgotten about the `!`. Before the change, we used to iterate over the body multiple times, until the depth computation is stable. When I saw this, I assumed this was not necessary, since the `body` is already ordered, such that `def` is before `use`. So I reduced it to a single pass over the `body`.
But this assumption was wrong: I added some assertion code, which detected that something was wrong with the ordering in the `body`. In the failing example, I saw that we had a `Load` and a `Store` with the same memory state. Given the edges, our ordering algorithm for the `body` could schedule `Load` before `Store` or `Store` before `Load`. But that is incorrect: our assumption is that in such cases `Loads` always happen before `Stores`.
Therefore, I had to change the traversal order in `VLoopBody::construct`, so that we visit `Load` before `Store`. With this, I now know that the `body` order is correct for both the data dependency and the memory dependency. Therefore, I only need to iterate over the `body` once in `VLoopDependencyGraph::compute_depth`.
**More Backgroud / Details**
This bug was reported because there were timeouts with `TestAlignVectorFuzzer.java`. This fix seems to improve the compile time drastically for the example below. It seems to be an example with a large dependency graph, where we still attempt to create some packs. This means there is a large amount of `independence` checks on the dependency graph. If those are not pruned well, then they visit many more nodes than necessary.
Why did I not catch this earlier? I had a compile time benchmark for [JDK-8325651](https://bugs.openjdk.org/browse/JDK-8325651) / https://github.com/openjdk/jdk/pull/17812, but it seems it was not sensitive enough. It has a dense graph, but never actually created any packs. My new benchmark creates packs, which unlocks more checks during `filter_packs_for_mutual_independence`, which seem to be much more stressing the dependency graph traversals.
If such large dense dependency graphs turn out to be very common, we could take more drastic steps in the future:
- Bail out of SuperWord if the graph gets too large.
- Implement a data structure that is better for dense graphs, such as a matrix, where we mark the cell for `(n1, n2)` corresponding to the `independence(n1, n2)` query. This would make independence checks a constant time lookup, rather than a graph traversal.
--------------------
I extracted a simple compile-time benchmark from `TestAlignVectorFuzzer.java`:
`/oracle-work/jdk-fork2/build/linux-x64/jdk/bin/java -XX:CompileCommand=printcompilation,TestGraph2::* -XX:CompileCommand=compileonly,TestGraph2::test* -XX:+CITime -XX:+UnlockDiagnosticVMOptions -XX:RepeatCompilation=0 -XX:LoopUnrollLimit=1000 -Xbatch TestGraph2.java`
With patch:
C2 Compile Time: 8.234 s
IdealLoop: 8.170 s
AutoVectorize: 7.789 s
master:
C2 Compile Time: 56.223 s
IdealLoop: 56.017 s
AutoVectorize: 55.576 s
import java.util.Random;
class TestGraph2 {
private static final Random random = new Random();
static final int RANGE_CON = 1024 * 8;
static int init = 593436;
static int limit = 599554;
static int offset1 = -592394;
static int offset2 = -592386;
static final int offset3 = -592394;
static final int stride = 4;
static final int scale = 1;
static final int hand_unrolling1 = 2;
static final int hand_unrolling2 = 8;
static final int hand_unrolling3 = 15;
public static void main(String[] args) {
byte[] aB = generateB();
byte[] bB = generateB();
byte[] cB = generateB();
for (int i = 1; i < 100; i++) {
testUUBBBH(aB, bB, cB);
}
}
static byte[] generateB() {
byte[] a = new byte[RANGE_CON];
for (int i = 0; i < a.length; i++) {
a[i] = (byte)random.nextInt();
}
return a;
}
static Object[] testUUBBBH(byte[] a, byte[] b, byte[] c) {
int h1 = hand_unrolling1;
int h2 = hand_unrolling2;
int h3 = hand_unrolling3;
for (int i = init; i < limit; i += stride) {
if (h1 >= 1) { a[offset1 + i * scale + 0]++; }
if (h1 >= 2) { a[offset1 + i * scale + 1]++; }
if (h1 >= 3) { a[offset1 + i * scale + 2]++; }
if (h1 >= 4) { a[offset1 + i * scale + 3]++; }
if (h1 >= 5) { a[offset1 + i * scale + 4]++; }
if (h1 >= 6) { a[offset1 + i * scale + 5]++; }
if (h1 >= 7) { a[offset1 + i * scale + 6]++; }
if (h1 >= 8) { a[offset1 + i * scale + 7]++; }
if (h1 >= 9) { a[offset1 + i * scale + 8]++; }
if (h1 >= 10) { a[offset1 + i * scale + 9]++; }
if (h1 >= 11) { a[offset1 + i * scale + 10]++; }
if (h1 >= 12) { a[offset1 + i * scale + 11]++; }
if (h1 >= 13) { a[offset1 + i * scale + 12]++; }
if (h1 >= 14) { a[offset1 + i * scale + 13]++; }
if (h1 >= 15) { a[offset1 + i * scale + 14]++; }
if (h1 >= 16) { a[offset1 + i * scale + 15]++; }
if (h2 >= 1) { b[offset2 + i * scale + 0]++; }
if (h2 >= 2) { b[offset2 + i * scale + 1]++; }
if (h2 >= 3) { b[offset2 + i * scale + 2]++; }
if (h2 >= 4) { b[offset2 + i * scale + 3]++; }
if (h2 >= 5) { b[offset2 + i * scale + 4]++; }
if (h2 >= 6) { b[offset2 + i * scale + 5]++; }
if (h2 >= 7) { b[offset2 + i * scale + 6]++; }
if (h2 >= 8) { b[offset2 + i * scale + 7]++; }
if (h2 >= 9) { b[offset2 + i * scale + 8]++; }
if (h2 >= 10) { b[offset2 + i * scale + 9]++; }
if (h2 >= 11) { b[offset2 + i * scale + 10]++; }
if (h2 >= 12) { b[offset2 + i * scale + 11]++; }
if (h2 >= 13) { b[offset2 + i * scale + 12]++; }
if (h2 >= 14) { b[offset2 + i * scale + 13]++; }
if (h2 >= 15) { b[offset2 + i * scale + 14]++; }
if (h2 >= 16) { b[offset2 + i * scale + 15]++; }
if (h3 >= 1) { c[offset3 + i * scale + 0]++; }
if (h3 >= 2) { c[offset3 + i * scale + 1]++; }
if (h3 >= 3) { c[offset3 + i * scale + 2]++; }
if (h3 >= 4) { c[offset3 + i * scale + 3]++; }
if (h3 >= 5) { c[offset3 + i * scale + 4]++; }
if (h3 >= 6) { c[offset3 + i * scale + 5]++; }
if (h3 >= 7) { c[offset3 + i * scale + 6]++; }
if (h3 >= 8) { c[offset3 + i * scale + 7]++; }
if (h3 >= 9) { c[offset3 + i * scale + 8]++; }
if (h3 >= 10) { c[offset3 + i * scale + 9]++; }
if (h3 >= 11) { c[offset3 + i * scale + 10]++; }
if (h3 >= 12) { c[offset3 + i * scale + 11]++; }
if (h3 >= 13) { c[offset3 + i * scale + 12]++; }
if (h3 >= 14) { c[offset3 + i * scale + 13]++; }
if (h3 >= 15) { c[offset3 + i * scale + 14]++; }
if (h3 >= 16) { c[offset3 + i * scale + 15]++; }
}
return new Object[]{ a, b, c };
}
}
-------------
Commit messages:
- Load / Store precedence
- add some comments and asserts
- ensure Load/Store order
- Merge branch 'master' into JDK-8327978-dependency-graph-comp-time-regression
- 8327978
Changes: https://git.openjdk.org/jdk/pull/18532/files
Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=18532&range=00
Issue: https://bugs.openjdk.org/browse/JDK-8327978
Stats: 46 lines in 2 files changed: 44 ins; 0 del; 2 mod
Patch: https://git.openjdk.org/jdk/pull/18532.diff
Fetch: git fetch https://git.openjdk.org/jdk.git pull/18532/head:pull/18532
PR: https://git.openjdk.org/jdk/pull/18532
More information about the hotspot-compiler-dev
mailing list