RFR: 8341697: C2: Register allocation inefficiency in tight loop [v6]

Mon Oct 14 08:59:18 UTC 2024

On Sun, 13 Oct 2024 07:03:04 GMT, Quan Anh Mai <qamai at openjdk.org> wrote:

>> Hi,
>> 
>> This patch improves the spill placement in the presence of loops. Currently, when trying to spill a live range, we will create a `Phi` at the loop head, this `Phi` will then be spilt inside the loop body, and as the `Phi` is `UP` (lives in register) at the loop head, we need to emit an additional reload at the loop back-edge block. This introduces loop-carried dependencies, greatly reduces loop throughput.
>> 
>> My proposal is to be aware of loop heads and try to eagerly spill or reload live ranges at the loop entries. In general, if a live range is spilt in the loop common path, then we should spill it in the loop entries and reload it at its use sites, this may increase the number of loads but will eliminate loop-carried dependencies, making the load latency-free. On the otherhand, if a live range is only spilt in the uncommon path but is used in the common path, then we should reload it eagerly. I think it is appropriate to bias towards spilling, i.e. if a live range is both spilt and reloaded in the common path, we spill it. This eliminates loop-carried dependencies.
>> 
>> A downfall of this algorithm is that we may overspill, which means that after spilling some live ranges, the others do not need to be spilt anymore but are unnecessarily spilt.
>> 
>> - A possible approach is to split the live ranges one-by-one and try to colour them afterwards. This seems prohibitively expensive.
>> - Another approach is to be aware of the number of registers that need spilling, sorting the live ones accordingly.
>> - Finally, we can eagerly split a live range at uncommon branches and do conservative coalescing afterwards. I think this is the most elegant and efficient solution for that.
>> 
>> Please take a look and leave your reviews, thanks a lot.
>
> Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision:
> 
>   refine comments + typo

Fix confirmed. Performance matches the user expectation when pulling data local. I will look into the runtime difference for the plain loop and systemcopy.

### Old - JDK 23.0.0

Benchmark                                   (SIZE)  Mode  Cnt     Score     Error  Units
Example8ArrayCopying.manualCopy1              1000  avgt   10    70.222 ±   3.549  ns/op
Example8ArrayCopying.manualCopy2              1000  avgt   10    70.011 ±   0.880  ns/op
Example8ArrayCopying.manualCopyAntiUnroll1    1000  avgt   10   394.275 ±  20.067  ns/op
Example8ArrayCopying.manualCopyAntiUnroll2    1000  avgt   10   636.158 ± 101.505  ns/op
Example8ArrayCopying.manualCopyAntiUnroll3    1000  avgt   10  1646.330 ±  23.042  ns/op
Example8ArrayCopying.systemCopy               1000  avgt   10    74.845 ±   1.535  ns/op

### New - JDK 24-internal (merrykitty/improveregalloc, 12d1a2b21fc62145dac04fecf43f267f539b2aa5)

Example8ArrayCopying.manualCopy1              1000  avgt   10   80.155 ±  4.504  ns/op
Example8ArrayCopying.manualCopy2              1000  avgt   10   81.122 ±  3.074  ns/op
Example8ArrayCopying.manualCopyAntiUnroll1    1000  avgt   10  394.094 ±  6.809  ns/op
Example8ArrayCopying.manualCopyAntiUnroll2    1000  avgt   10  626.155 ± 13.055  ns/op
Example8ArrayCopying.manualCopyAntiUnroll3    1000  avgt   10  564.199 ± 23.854  ns/op
Example8ArrayCopying.systemCopy               1000  avgt   10   99.393 ±  0.634  ns/op

Source code for reference: https://github.com/Xceptance/jmh-training/blob/1dbcc9c38553b0e8b683c6f70475a25150b66635/src/main/java/org/xc/jmh/Example8ArrayCopying.java

-------------

PR Comment: https://git.openjdk.org/jdk/pull/21472#issuecomment-2410501449