RFR: 8327978: C2 SuperWord: Fix compilation time regression in dependency graph traversal after JDK-8325651 [v3]

Christian Hagedorn chagedorn at openjdk.org
Thu Apr 4 11:29:10 UTC 2024


On Thu, 4 Apr 2024 05:21:24 GMT, Emanuel Peter <epeter at openjdk.org> wrote:

>> In [JDK-8325651](https://bugs.openjdk.org/browse/JDK-8325651) / https://github.com/openjdk/jdk/pull/17812 I refactored the dependency graph. It seems I made a typo, and missed a single `!`, which broke `VLoopDependencyGraph::compute_depth` (formerly `SuperWord::compute_max_depth`).
>> 
>> The consequence was that all nodes in the dependency graph had the same depth `1`. A node is supposed to have a higher depth than all its inputs, except for Phi nodes, which have depth 0, as they are at the beginning of the loop's basic block, i.e. they are at the beginning of the DAG.
>> 
>> **Details**
>> 
>> Well, it is a bit more complicated. I had not just forgotten about the `!`. Before the change, we used to iterate over the body multiple times, until the depth computation is stable. When I saw this, I assumed this was not necessary, since the `body` is already ordered, such that `def` is before `use`. So I reduced it to a single pass over the `body`.
>> 
>> But this assumption was wrong: I added some assertion code, which detected that something was wrong with the ordering in the `body`. In the failing example, I saw that we had a `Load` and a `Store` with the same memory state. Given the edges, our ordering algorithm for the `body` could schedule `Load` before `Store` or `Store` before `Load`. But that is incorrect: our assumption is that in such cases `Loads` always happen before `Stores`.
>> 
>> Therefore, I had to change the traversal order in `VLoopBody::construct`, so that we visit `Load` before `Store`. With this, I now know that the `body` order is correct for both the data dependency and the memory dependency. Therefore, I only need to iterate over the `body` once in `VLoopDependencyGraph::compute_depth`.
>> 
>> **More Backgroud / Details**
>> 
>> This bug was reported because there were timeouts with `TestAlignVectorFuzzer.java`. This fix seems to improve the compile time drastically for the example below. It seems to be an example with a large dependency graph, where we still attempt to create some packs. This means there is a large amount of `independence` checks on the dependency graph. If those are not pruned well, then they visit many more nodes than necessary.
>> 
>> Why did I not catch this earlier? I had a compile time benchmark for [JDK-8325651](https://bugs.openjdk.org/browse/JDK-8325651) / https://github.com/openjdk/jdk/pull/17812, but it seems it was not sensitive enough. It has a dense graph, but never actually created any packs. My new benchmark creates ...
>
> Emanuel Peter has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains seven additional commits since the last revision:
> 
>  - Merge branch 'master' into JDK-8327978-dependency-graph-comp-time-regression
>  - For Vladimir: check is_Store first
>  - Load / Store precedence
>  - add some comments and asserts
>  - ensure Load/Store order
>  - Merge branch 'master' into JDK-8327978-dependency-graph-comp-time-regression
>  - 8327978

Good catch! Could such a long-running/compiling test also be added as jtreg test which fails due to a timeout without this patch and passes with the patch?

src/hotspot/share/opto/vectorization.cpp line 324:

> 322:         }
> 323:       }
> 324:     }

This is identical to the loop above. Could this code be shared (e.g. `find_max_pred_depth()`)?

-------------

PR Review: https://git.openjdk.org/jdk/pull/18532#pullrequestreview-1979544235
PR Review Comment: https://git.openjdk.org/jdk/pull/18532#discussion_r1551475445


More information about the hotspot-compiler-dev mailing list