RFR: 8325651: C2 SuperWord: refactor the dependency graph
Emanuel Peter
epeter at openjdk.org
Wed Feb 28 13:22:15 UTC 2024
On Mon, 12 Feb 2024 16:24:30 GMT, Emanuel Peter <epeter at openjdk.org> wrote:
> Subtask of https://github.com/openjdk/jdk/pull/16620.
>
> **Goal**
>
> - Make the dependency graph, make it a module of `VLoopAnalyzer` -> `VLoopDependencyGraph`.
> - Refactoring: replace linked-list edges with a compact array for each node.
> - No behavioral change to vectorization.
>
> **Benchmark**
>
> I have a large loop body (extra large with hand-unrolling and ` -XX:LoopUnrollLimit=1000`).
> All stores are connected to all previous loads, effectively creating an `O(n^2)` size graph,
> ensuring that we spend a lot of time on the dependency graph compared to other components.
>
> Measured on `linux-x64` and turbo disabled.
>
> Measuring Compile time difference:
> `/oracle-work/jdk-fork2/build/linux-x64/jdk/bin/java -XX:CompileCommand=compileonly,TestGraph::test* -XX:+CITime -XX:+UnlockDiagnosticVMOptions -XX:RepeatCompilation=20 -XX:LoopUnrollLimit=1000 -Xbatch TestGraph.java`
>
> TestGraph.java
>
> public class TestGraph {
> static int RANGE = 100_000;
>
> public static void main(String[] args) {
> int[] a = new int[RANGE];
> int[] b = new int[RANGE];
> for (int i = 0; i < 10_000; i++) {
> test1(a, b, i % 100);
> }
> }
>
> static void test1(int[] a, int[] b, int offset) {
> for (int i = 0; i < RANGE/16-200; i++) {
> a[i * 16 + 0] = b[i * 16 + 0 + offset];
> a[i * 16 + 1] = b[i * 16 + 1 + offset];
> a[i * 16 + 2] = b[i * 16 + 2 + offset];
> a[i * 16 + 3] = b[i * 16 + 3 + offset];
> a[i * 16 + 4] = b[i * 16 + 4 + offset];
> a[i * 16 + 5] = b[i * 16 + 5 + offset];
> a[i * 16 + 6] = b[i * 16 + 6 + offset];
> a[i * 16 + 7] = b[i * 16 + 7 + offset];
> a[i * 16 + 8] = b[i * 16 + 8 + offset];
> a[i * 16 + 9] = b[i * 16 + 9 + offset];
> a[i * 16 + 10] = b[i * 16 + 10 + offset];
> a[i * 16 + 11] = b[i * 16 + 11 + offset];
> a[i * 16 + 12] = b[i * 16 + 12 + offset];
> a[i * 16 + 13] = b[i * 16 + 13 + offset];
> a[i * 16 + 14] = b[i * 16 + 14 + offset];
> a[i * 16 + 15] = b[i * 16 + 15 + offset];
> }
> }
> }
>
>
>
> Before:
>
> C2 Compile Time: 14.588 s
> ...
> IdealLoop: 13.670 s
> AutoVectorize: 11.703 s```
>
> After:
>
> C2 Compile Time: 14.468 s
> ...
> IdealLoop: 13.595 s
> AutoVectorize: 11.539 s
>
>
> Memory usage: `-XX:CompileCommand=memstat,TestGraph::test*,print`
> Before: `8_225_680 - 8_258_408` byt...
src/hotspot/share/opto/superword.cpp line 459:
> 457:
> 458: // compute function depth(Node*)
> 459: compute_max_depth();
Note: moved to `VLoopAnalyzer::setup_submodules_helper`
src/hotspot/share/opto/superword.cpp line 756:
> 754: // Add dependence edges to load/store nodes for memory dependence
> 755: // A.out()->DependNode.in(1) and DependNode.out()->B.prec(x)
> 756: void SuperWord::dependence_graph() {
Note: new code is in `VLoopDependencyGraph::construct`. The general idea is the same, but I refactored it a bit.
src/hotspot/share/opto/superword.cpp line 788:
> 786: // Create a sink for the slice
> 787: DepMem* slice_sink = _dg.make_node(nullptr);
> 788: _dg.make_edge(slice_sink, _dg.tail());
Note: as far as I can tell, there is no reason to create these `sink` nodes. They are later never referenced again.
src/hotspot/share/opto/superword.cpp line 3064:
> 3062: }
> 3063: ct++;
> 3064: } while (again);
Note: moved to `VLoopDependencyGraph::compute_depth`. Simplified it, we only need a single pass.
src/hotspot/share/opto/superword.cpp line 3815:
> 3813: _dep_next = nullptr;
> 3814: }
> 3815: next();
Note: this was to set up the iteration over the `n->in(i)` (limited range, based on node type), and iteration over the `dep` edges (stored in a linked list).
I don't think the `is_Mem` case ever occurs (added an assert in my new code).
New code in: `VLoopDependencyGraph::PredsIterator::PredsIterator`
src/hotspot/share/opto/superword.cpp line 3850:
> 3848: _dep_next = nullptr;
> 3849: }
> 3850: next();
Note: `DepSuccs` was never used
src/hotspot/share/opto/superword.hpp line 73:
> 71: DepMem* _succ;
> 72: DepEdge* _next_in; // list of in edges, null terminated
> 73: DepEdge* _next_out; // list of out edges, null terminated
Note: the `out / succ` edges were never used.
src/hotspot/share/opto/superword.hpp line 293:
> 291: return TraceSuperWord ||
> 292: _vloop.vtrace().is_trace(TraceAutoVectorizationTag::SW_DEPENDENCE_GRAPH);
> 293: }
Noe: replaced with `DEPENDENCY_GRAPH`
src/hotspot/share/opto/superword.hpp line 367:
> 365: // Max expression (DAG) depth from beginning of the block for each node
> 366: int depth(Node* n) const { return _node_info.adr_at(bb_idx(n))->_depth; }
> 367: void set_depth(Node* n, int d) { int i = bb_idx(n); grow_node_info(i); _node_info.adr_at(i)->_depth = d; }
Note: bonus: `depth` and `set_depth` is not private in `VLoopDependencyGraph`.
src/hotspot/share/opto/vectorization.hpp line 510:
> 508: MemNode* _node; // Corresponding ideal node
> 509: const uint _extra_pred_edges_length;
> 510: int* _extra_pred_edges; // extra def-edges, mapping to bb_idx
Note: I'm now using an array instead of a linked list for the nodes.
-------------
PR Review Comment: https://git.openjdk.org/jdk/pull/17812#discussion_r1505819042
PR Review Comment: https://git.openjdk.org/jdk/pull/17812#discussion_r1505731065
PR Review Comment: https://git.openjdk.org/jdk/pull/17812#discussion_r1505732210
PR Review Comment: https://git.openjdk.org/jdk/pull/17812#discussion_r1505733586
PR Review Comment: https://git.openjdk.org/jdk/pull/17812#discussion_r1505736605
PR Review Comment: https://git.openjdk.org/jdk/pull/17812#discussion_r1505737280
PR Review Comment: https://git.openjdk.org/jdk/pull/17812#discussion_r1505797283
PR Review Comment: https://git.openjdk.org/jdk/pull/17812#discussion_r1505798923
PR Review Comment: https://git.openjdk.org/jdk/pull/17812#discussion_r1505820489
PR Review Comment: https://git.openjdk.org/jdk/pull/17812#discussion_r1505828381
More information about the hotspot-compiler-dev
mailing list