RFR: 8341697: C2: Register allocation inefficiency in tight loop [v6]
Rene Schwietzke
duke at openjdk.org
Mon Oct 14 08:59:18 UTC 2024
On Sun, 13 Oct 2024 07:03:04 GMT, Quan Anh Mai <qamai at openjdk.org> wrote:
>> Hi,
>>
>> This patch improves the spill placement in the presence of loops. Currently, when trying to spill a live range, we will create a `Phi` at the loop head, this `Phi` will then be spilt inside the loop body, and as the `Phi` is `UP` (lives in register) at the loop head, we need to emit an additional reload at the loop back-edge block. This introduces loop-carried dependencies, greatly reduces loop throughput.
>>
>> My proposal is to be aware of loop heads and try to eagerly spill or reload live ranges at the loop entries. In general, if a live range is spilt in the loop common path, then we should spill it in the loop entries and reload it at its use sites, this may increase the number of loads but will eliminate loop-carried dependencies, making the load latency-free. On the otherhand, if a live range is only spilt in the uncommon path but is used in the common path, then we should reload it eagerly. I think it is appropriate to bias towards spilling, i.e. if a live range is both spilt and reloaded in the common path, we spill it. This eliminates loop-carried dependencies.
>>
>> A downfall of this algorithm is that we may overspill, which means that after spilling some live ranges, the others do not need to be spilt anymore but are unnecessarily spilt.
>>
>> - A possible approach is to split the live ranges one-by-one and try to colour them afterwards. This seems prohibitively expensive.
>> - Another approach is to be aware of the number of registers that need spilling, sorting the live ones accordingly.
>> - Finally, we can eagerly split a live range at uncommon branches and do conservative coalescing afterwards. I think this is the most elegant and efficient solution for that.
>>
>> Please take a look and leave your reviews, thanks a lot.
>
> Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision:
>
> refine comments + typo
Fix confirmed. Performance matches the user expectation when pulling data local. I will look into the runtime difference for the plain loop and systemcopy.
### Old - JDK 23.0.0
Benchmark (SIZE) Mode Cnt Score Error Units
Example8ArrayCopying.manualCopy1 1000 avgt 10 70.222 ± 3.549 ns/op
Example8ArrayCopying.manualCopy2 1000 avgt 10 70.011 ± 0.880 ns/op
Example8ArrayCopying.manualCopyAntiUnroll1 1000 avgt 10 394.275 ± 20.067 ns/op
Example8ArrayCopying.manualCopyAntiUnroll2 1000 avgt 10 636.158 ± 101.505 ns/op
Example8ArrayCopying.manualCopyAntiUnroll3 1000 avgt 10 1646.330 ± 23.042 ns/op
Example8ArrayCopying.systemCopy 1000 avgt 10 74.845 ± 1.535 ns/op
### New - JDK 24-internal (merrykitty/improveregalloc, 12d1a2b21fc62145dac04fecf43f267f539b2aa5)
Example8ArrayCopying.manualCopy1 1000 avgt 10 80.155 ± 4.504 ns/op
Example8ArrayCopying.manualCopy2 1000 avgt 10 81.122 ± 3.074 ns/op
Example8ArrayCopying.manualCopyAntiUnroll1 1000 avgt 10 394.094 ± 6.809 ns/op
Example8ArrayCopying.manualCopyAntiUnroll2 1000 avgt 10 626.155 ± 13.055 ns/op
Example8ArrayCopying.manualCopyAntiUnroll3 1000 avgt 10 564.199 ± 23.854 ns/op
Example8ArrayCopying.systemCopy 1000 avgt 10 99.393 ± 0.634 ns/op
Source code for reference: https://github.com/Xceptance/jmh-training/blob/1dbcc9c38553b0e8b683c6f70475a25150b66635/src/main/java/org/xc/jmh/Example8ArrayCopying.java
-------------
PR Comment: https://git.openjdk.org/jdk/pull/21472#issuecomment-2410501449
More information about the hotspot-compiler-dev
mailing list