RFR: 8333258: C2: high memory usage in PhaseCFG::insert_anti_dependences() [v2]

Fri Jul 5 07:56:28 UTC 2024

On Tue, 25 Jun 2024 13:15:47 GMT, Roland Westrelin <roland at openjdk.org> wrote:

>> In a debug build, `PhaseCFG::insert_anti_dependences()` is called
>> twice for a single node: once for actual processing, once for
>> verification.
>> 
>> In TestAntiDependenciesHighMemUsage, the test has a `Region` that
>> merges 337 incoming path. It also has one `Phi` per memory slice that
>> are stored to: 1000 `Phi` nodes. Each `Phi` node has 337 inputs that
>> are identical except for one. The common input is the memory state on
>> method entry. The test has 60 `Load` that needs to be processed for
>> anti dependences. All `Load` share the same memory input: the memory
>> state on method entry. For each `Load`, all `Phi` nodes are pushed 336
>> times on the work lists for anti dependence processing because all of
>> them appear multiple times as uses of each `Load`s memory state: `Phi`s
>> are pushed 336 000 on 2 work lists. Memory is not reclaimed on exit
>> from `PhaseCFG::insert_anti_dependences()` so memory usage grows as
>> `Load` nodes are processed:
>> 
>> 336000 * 2 work lists * 60 loads * 8 bytes pointer = 322 MB. 
>> 
>> The fix I propose for this is to not push `Phi` nodes more than once
>> when they have the same inputs multiple times.
>> 
>> In TestAntiDependenciesHighMemUsage2, the test has 4000 loads. For
>> each of them, when processed for anti dependences, all 4000 loads are
>> pushed on the work lists because they share the same memory
>> input. Then when they are popped from the work list, they are
>> discarded because only stores are of interest:
>> 
>> 4000 loads processed * 4000 loads pushed * 2 work lists * 8 bytes pointer = 256 MB. 
>> 
>> The fix I propose for this is to test before pushing on the work list
>> whether a node is a store or not.
>> 
>> Finally, I propose adding a `ResourceMark` so memory doesn't
>> accumulate over calls to `PhaseCFG::insert_anti_dependences()`.
>
> Roland Westrelin has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains four additional commits since the last revision:
> 
>  - review
>  - Merge branch 'master' into JDK-8333258
>  - whitespaces
>  - tests & fix

@rwestrel thanks for the updates! Sorry it took me a while to come back to this.

I think most likely your fix is correct. But I have a few complaints / suggestions, some of which I have already mentioned but you did not yet respond to - maybe you disagree with them. Sorry, I'll be a little annoying now 😬 

- The naming of `store` in this code is really bad. Rather than `mem` and `store`, it should be `use` and `def`. Because apparently a `store` can also be a `Phi` - that is not very intuitive.
- This bad naming was there before your fix, but you are leaning into it with code like this, and this just makes the code even more confusing than it already is:

        } else if (store->is_Phi()) {
          // A Phi could have the same mem as input multiple times. If that's the case, we don't need to enqueue it
          // more than once.
          if (already_enqueued(worklist_mem, worklist_store, mem, store->as_Phi())) {
            continue;
          }

And:

assert(!store->needs_anti_dependence_check(), "only stores");

But it turns out that also a `MergeMem` has no anti-dependence-check, as the code suggests. So your assert text should be more accurate.

- Existing comments are half-accurate it seems:

  // Therefore, the branches visited by the worklist are of this form:
  //    initial_mem -> (MergeMem ->)* store

There are not just `MergeMem` on those branches, but also `Phi`s, right? While we are working on this code, we might as well fix that.

- For clarity, I would rename the inputs of `already_enqueued`:

static bool already_enqueued(const Node_List& worklist_mem, const Node_List& worklist_store, Node* mem, PhiNode* phi) {
->
static bool already_enqueued(const Node_List& worklist_mem, const Node_List& worklist_store, Node* def_mem, PhiNode* use_phi) {

- Another refactoring suggestion: Rather than having lengthy comments about saying that `worklist_mem/worklist_store` is a pair of lists, I would make it into a list of pairs. This could be called `edge_worklist` or `use_def_worklist`. Maybe there are even memory benefits because of better co-location.

- You removed the last:

if (store->needs_anti_dependence_check())  continue;  // not really a store

That is fine, but I would replace it with an assert, this helps the reader to know that after the traversal loop we actually have some node with that characteristic.

- The argument you are making in `already_enqueued` (that all of `mem`'s uses are processed in one go) is quite implicit. How much runtime would we actually lose if we scanned the whole worklist instead? I'm just wondering if this is not a premature optimization, that then requires an argument that is not clear to the reader (it was not immediately clear to me and also not to Vladimir K).

-------------

PR Review: https://git.openjdk.org/jdk/pull/19791#pullrequestreview-2160164410