RFR: 8325651: C2 SuperWord: refactor the dependency graph [v2]

Thu Mar 7 15:57:04 UTC 2024

On Thu, 7 Mar 2024 15:33:21 GMT, Emanuel Peter <epeter at openjdk.org> wrote:

>> Subtask of https://github.com/openjdk/jdk/pull/16620.
>> 
>> **Goal**
>> 
>> - Make the dependency graph, make it a module of `VLoopAnalyzer` -> `VLoopDependencyGraph`.
>> - Refactoring: replace linked-list edges with a compact array for each node.
>> - No behavioral change to vectorization.
>> 
>> **Benchmark**
>> 
>> I have a large loop body (extra large with hand-unrolling and ` -XX:LoopUnrollLimit=1000`).
>> All stores are connected to all previous loads, effectively creating an `O(n^2)` size graph,
>> ensuring that we spend a lot of time on the dependency graph compared to other components.
>> 
>> Measured on `linux-x64` and turbo disabled.
>> 
>> Measuring Compile time difference:
>> `/oracle-work/jdk-fork2/build/linux-x64/jdk/bin/java -XX:CompileCommand=compileonly,TestGraph::test* -XX:+CITime -XX:+UnlockDiagnosticVMOptions -XX:RepeatCompilation=20 -XX:LoopUnrollLimit=1000 -Xbatch TestGraph.java`
>> 
>> TestGraph.java
>> 
>> public class TestGraph {
>>     static int RANGE = 100_000;
>> 
>>     public static void main(String[] args) {
>>         int[] a = new int[RANGE];
>>         int[] b = new int[RANGE];
>>         for (int i = 0; i < 10_000; i++) {
>>             test1(a, b, i % 100);
>>         }
>>     }
>> 
>>     static void test1(int[] a, int[] b, int offset) {
>>         for (int i = 0; i < RANGE/16-200; i++) {
>>             a[i * 16 +  0] = b[i * 16 +  0 + offset];
>>             a[i * 16 +  1] = b[i * 16 +  1 + offset];
>>             a[i * 16 +  2] = b[i * 16 +  2 + offset];
>>             a[i * 16 +  3] = b[i * 16 +  3 + offset];
>>             a[i * 16 +  4] = b[i * 16 +  4 + offset];
>>             a[i * 16 +  5] = b[i * 16 +  5 + offset];
>>             a[i * 16 +  6] = b[i * 16 +  6 + offset];
>>             a[i * 16 +  7] = b[i * 16 +  7 + offset];
>>             a[i * 16 +  8] = b[i * 16 +  8 + offset];
>>             a[i * 16 +  9] = b[i * 16 +  9 + offset];
>>             a[i * 16 + 10] = b[i * 16 + 10 + offset];
>>             a[i * 16 + 11] = b[i * 16 + 11 + offset];
>>             a[i * 16 + 12] = b[i * 16 + 12 + offset];
>>             a[i * 16 + 13] = b[i * 16 + 13 + offset];
>>             a[i * 16 + 14] = b[i * 16 + 14 + offset];
>>             a[i * 16 + 15] = b[i * 16 + 15 + offset];
>>         }
>>     }
>> }
>> 
>> 
>> 
>> Before:
>> 
>>     C2 Compile Time:       14.588 s
>> ...
>>          IdealLoop:            13.670 s
>>            AutoVectorize:      11.703 s```
>> 
>> After:
>> 
>>     C2 Compile Time:       14.468 s
>> ...
>>          IdealLoop:            13.595 s
>>      ...
>
> Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision:
> 
>   rename extra -> memory

src/hotspot/share/opto/vectorization.cpp line 225:

> 223:   DependencyNode* dn = new (_arena) DependencyNode(n, memory_pred_edges, _arena);
> 224:   _dependency_nodes.at_put_grow(_body.bb_idx(n), dn, nullptr);
> 225: }

The call to `add_node()` suggests that we add a node no matter what. I therefore suggest to either change `add_node` to something like `maybe_add_node` or do the check like that:

if (memory_pred_edges.is_nonempty()) {
  add_node(n1, memory_pred_edges);
}

src/hotspot/share/opto/vectorization.cpp line 285:

> 283:     _memory_pred_edges(nullptr)
> 284: {
> 285:   assert(memory_pred_edges.length() > 0, "not empty");

Suggestion:

  assert(memory_pred_edges.is_nonempty(), "not empty");

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/17812#discussion_r1516366524
PR Review Comment: https://git.openjdk.org/jdk/pull/17812#discussion_r1516376416