RFR: 8341697: C2: Register allocation inefficiency in tight loop [v6]

Mon Oct 14 13:45:23 UTC 2024

On Sun, 13 Oct 2024 07:03:04 GMT, Quan Anh Mai <qamai at openjdk.org> wrote:

>> Hi,
>> 
>> This patch improves the spill placement in the presence of loops. Currently, when trying to spill a live range, we will create a `Phi` at the loop head, this `Phi` will then be spilt inside the loop body, and as the `Phi` is `UP` (lives in register) at the loop head, we need to emit an additional reload at the loop back-edge block. This introduces loop-carried dependencies, greatly reduces loop throughput.
>> 
>> My proposal is to be aware of loop heads and try to eagerly spill or reload live ranges at the loop entries. In general, if a live range is spilt in the loop common path, then we should spill it in the loop entries and reload it at its use sites, this may increase the number of loads but will eliminate loop-carried dependencies, making the load latency-free. On the otherhand, if a live range is only spilt in the uncommon path but is used in the common path, then we should reload it eagerly. I think it is appropriate to bias towards spilling, i.e. if a live range is both spilt and reloaded in the common path, we spill it. This eliminates loop-carried dependencies.
>> 
>> A downfall of this algorithm is that we may overspill, which means that after spilling some live ranges, the others do not need to be spilt anymore but are unnecessarily spilt.
>> 
>> - A possible approach is to split the live ranges one-by-one and try to colour them afterwards. This seems prohibitively expensive.
>> - Another approach is to be aware of the number of registers that need spilling, sorting the live ones accordingly.
>> - Finally, we can eagerly split a live range at uncommon branches and do conservative coalescing afterwards. I think this is the most elegant and efficient solution for that.
>> 
>> Please take a look and leave your reviews, thanks a lot.
>
> Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision:
> 
>   refine comments + typo

Thanks for the source code. That's really interesting, running the benchmark multiple times may give different results, and even when there is a difference in the observed throughputs, the 2 compiled methods are exactly the same. So I think we are running into different quirks here, probably due to the fact that this benchmark saturates the memory bandwidth.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/21472#issuecomment-2411312474