Integrated: 8327978: C2 SuperWord: Fix compilation time regression in dependency graph traversal after JDK-8325651

Fri Apr 5 06:51:16 UTC 2024

On Thu, 28 Mar 2024 15:13:05 GMT, Emanuel Peter <epeter at openjdk.org> wrote:

> In [JDK-8325651](https://bugs.openjdk.org/browse/JDK-8325651) / https://github.com/openjdk/jdk/pull/17812 I refactored the dependency graph. It seems I made a typo, and missed a single `!`, which broke `VLoopDependencyGraph::compute_depth` (formerly `SuperWord::compute_max_depth`).
> 
> The consequence was that all nodes in the dependency graph had the same depth `1`. A node is supposed to have a higher depth than all its inputs, except for Phi nodes, which have depth 0, as they are at the beginning of the loop's basic block, i.e. they are at the beginning of the DAG.
> 
> **Details**
> 
> Well, it is a bit more complicated. I had not just forgotten about the `!`. Before the change, we used to iterate over the body multiple times, until the depth computation is stable. When I saw this, I assumed this was not necessary, since the `body` is already ordered, such that `def` is before `use`. So I reduced it to a single pass over the `body`.
> 
> But this assumption was wrong: I added some assertion code, which detected that something was wrong with the ordering in the `body`. In the failing example, I saw that we had a `Load` and a `Store` with the same memory state. Given the edges, our ordering algorithm for the `body` could schedule `Load` before `Store` or `Store` before `Load`. But that is incorrect: our assumption is that in such cases `Loads` always happen before `Stores`.
> 
> Therefore, I had to change the traversal order in `VLoopBody::construct`, so that we visit `Load` before `Store`. With this, I now know that the `body` order is correct for both the data dependency and the memory dependency. Therefore, I only need to iterate over the `body` once in `VLoopDependencyGraph::compute_depth`.
> 
> **More Backgroud / Details**
> 
> This bug was reported because there were timeouts with `TestAlignVectorFuzzer.java`. This fix seems to improve the compile time drastically for the example below. It seems to be an example with a large dependency graph, where we still attempt to create some packs. This means there is a large amount of `independence` checks on the dependency graph. If those are not pruned well, then they visit many more nodes than necessary.
> 
> Why did I not catch this earlier? I had a compile time benchmark for [JDK-8325651](https://bugs.openjdk.org/browse/JDK-8325651) / https://github.com/openjdk/jdk/pull/17812, but it seems it was not sensitive enough. It has a dense graph, but never actually created any packs. My new benchmark creates packs, which unlocks more checks d...

This pull request has now been integrated.

Changeset: 9da5170a
Author:    Emanuel Peter <epeter at openjdk.org>
URL:       https://git.openjdk.org/jdk/commit/9da5170a0eb9f141022f86d749af3b5780b75cb7
Stats:     181 lines in 4 files changed: 171 ins; 0 del; 10 mod

8327978: C2 SuperWord: Fix compilation time regression in dependency graph traversal after JDK-8325651

Reviewed-by: chagedorn, kvn

-------------

PR: https://git.openjdk.org/jdk/pull/18532