RFR: 8341697: C2: Register allocation inefficiency in tight loop [v7]

Thu May 22 15:04:01 UTC 2025

On Mon, 14 Oct 2024 14:17:09 GMT, Quan Anh Mai <qamai at openjdk.org> wrote:

>> Hi,
>> 
>> This patch improves the spill placement in the presence of loops. Currently, when trying to spill a live range, we will create a `Phi` at the loop head, this `Phi` will then be spilt inside the loop body, and as the `Phi` is `UP` (lives in register) at the loop head, we need to emit an additional reload at the loop back-edge block. This introduces loop-carried dependencies, greatly reduces loop throughput.
>> 
>> My proposal is to be aware of loop heads and try to eagerly spill or reload live ranges at the loop entries. In general, if a live range is spilt in the loop common path, then we should spill it in the loop entries and reload it at its use sites, this may increase the number of loads but will eliminate loop-carried dependencies, making the load latency-free. On the otherhand, if a live range is only spilt in the uncommon path but is used in the common path, then we should reload it eagerly. I think it is appropriate to bias towards spilling, i.e. if a live range is both spilt and reloaded in the common path, we spill it. This eliminates loop-carried dependencies.
>> 
>> A downfall of this algorithm is that we may overspill, which means that after spilling some live ranges, the others do not need to be spilt anymore but are unnecessarily spilt.
>> 
>> - A possible approach is to split the live ranges one-by-one and try to colour them afterwards. This seems prohibitively expensive.
>> - Another approach is to be aware of the number of registers that need spilling, sorting the live ones accordingly.
>> - Finally, we can eagerly split a live range at uncommon branches and do conservative coalescing afterwards. I think this is the most elegant and efficient solution for that.
>> 
>> Please take a look and leave your reviews, thanks a lot.
>
> Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision:
> 
>   fix uncommon_freq

Thanks for working on this @merykitty! Here is my long overdue review. I've studied both the existing logic in `PhaseChaitin::Split` (took quite some time) as well as your changes.

First, let me explain my understanding of the problem. Let me know if I've misunderstood something. In pass 1 during splitting, we get to a loop Phi with (at least) one unknown input, and determine from the other known inputs that the Phi is UP. Later on, it turns out one of the unknown inputs actually went DOWN, and additionally in a high frequency block. The Phi prematurely set to UP now forces us to put an expensive reload in the high frequency block.

> A downfall of this algorithm is that we may overspill, which means that after spilling some live ranges, the others do not need to be spilt anymore but are unnecessarily spilt.

I can reproduce the improvements for `LoopCounterBench.java`, so that looks good. Unfortunately, benchmarks (DaCapo, Renaissance, SPECjvm, SPECjbb) indicate more regressions than improvements. I've merged your changes with `master` and will rerun the benchmarks to double check.

Regarding the changeset, I think the code looks good overall. Can you elaborate more regarding the motivation behind the decision table and 10% threshold in `should_spill_before_loop`? What experiments did you run to confirm these are good choices? Perhaps the table and/or the threshold can be tuned to avoid the regressions found during the benchmarking above?

While I do think your suggested changes look reasonable, it seems to me that the fundamental problem is really that we only have partial reaching definitions information available when inserting Phis? I'm guessing it is too expensive to compute complete reaching defitions in the beginning of `PhaseChaitin::Split`, and to then update the reaching definitions continuously during splitting (because we add new definitions while splitting)? Would it perhaps be possible to first perform all the non-Phi splits, and then add required Phis afterwards? Just thinking out loud, there is likely some important interaction that I'm not aware of.

src/hotspot/share/opto/gcm.cpp line 2305:

> 2303:   }
> 2304: }
> 2305: 

Can you explain this removal?

src/hotspot/share/opto/reg_split.cpp line 493:

> 491: 
> 492: // Decide the action at the loop entry based on whether the live range is used and whether it needs spilling
> 493: // Common means that the frequency the action performed is more than 10% the frequency of the loop head

Suggestion:

// Common means that the frequency of the action is more than 10% the frequency of the loop head

src/hotspot/share/opto/reg_split.cpp line 494:

> 492: // Decide the action at the loop entry based on whether the live range is used and whether it needs spilling
> 493: // Common means that the frequency the action performed is more than 10% the frequency of the loop head
> 494: // Untaken means that the frequency the action performed is not more than the frequency of the loop entry (outside the loop)

Suggestion:

// Untaken means that the frequency of the action is not more than the frequency of the loop entry (outside the loop)

src/hotspot/share/opto/reg_split.cpp line 507:

> 505: //  Untaken   |  Reload   |   Reload   |    None
> 506: //
> 507: // In general, if a live range is spilt more than it is used, we try to eagerly spill it, and vice versa,

spilt (UK) -> spilled (US) to match the current spelling in the source code. Also in other places.

src/hotspot/share/opto/reg_split.cpp line 513:

> 511: static SpillAction should_spill_before_loop(const PhaseCFG& cfg, PhaseChaitin& chaitin, CFGLoop* loop, uint lrg_idx, const LRG& lrg) {
> 512:   constexpr double uncommon_threshold = 0.1;
> 513:   assert(&chaitin.lrgs(lrg_idx) == &lrg, "must be");

Please add a more descriptive assert message than the generic "must be".

src/hotspot/share/opto/reg_split.cpp line 522:

> 520:     Block* b = cfg.get_block(bidx);
> 521:     if (!loop->in_loop_nest(b)) {
> 522:       continue;

Is there not a more efficient way to iterate through all the loops in the loop nest?

src/hotspot/share/opto/reg_split.cpp line 525:

> 523:     }
> 524: 
> 525:     // Implementation details: high pressure only records the start idx, not the end idx

Suggestion:

    // Implementation detail: we only record the start idx of the high pressure within the block; there is no end idx

This is my understanding of how the high pressure tracking works, but please correct me if I'm wrong.

src/hotspot/share/opto/reg_split.cpp line 527:

> 525:     // Implementation details: high pressure only records the start idx, not the end idx
> 526:     if (is_high_pressure(b, &lrg, b->end_idx())) {
> 527:       // If a node needs spilling in a child loop, we can spill it at the child entry, too. Choose the best option.

I'm a bit confused with "Choose the best option". Am I correct to intepret this as "do not spill at the loop entry if there is a better option further down the loop tree that we will reach later on in pass 1"? Would be great to elaborate a bit in the comment (also further down, for the detection of uses).

-------------

Changes requested by dlunden (Committer).

PR Review: https://git.openjdk.org/jdk/pull/21472#pullrequestreview-2861541732
PR Review Comment: https://git.openjdk.org/jdk/pull/21472#discussion_r2102752992
PR Review Comment: https://git.openjdk.org/jdk/pull/21472#discussion_r2102757048
PR Review Comment: https://git.openjdk.org/jdk/pull/21472#discussion_r2102757834
PR Review Comment: https://git.openjdk.org/jdk/pull/21472#discussion_r2102762088
PR Review Comment: https://git.openjdk.org/jdk/pull/21472#discussion_r2102763947
PR Review Comment: https://git.openjdk.org/jdk/pull/21472#discussion_r2102766052
PR Review Comment: https://git.openjdk.org/jdk/pull/21472#discussion_r2102770416
PR Review Comment: https://git.openjdk.org/jdk/pull/21472#discussion_r2102771540