RFR: 8344130: C2: Avoid excessive hoisting in scheduler due to minuscule differences in block frequency

Thu Dec 19 13:59:40 UTC 2024

On Thu, 19 Dec 2024 12:48:19 GMT, Roberto Castañeda Lozano <rcastanedalo at openjdk.org> wrote:

>> `PhaseCFG::is_cheaper_block` can sometimes excessively hoist instructions through blocks due to minuscule differences in block frequency, even when the differences are likely caused by numerical imprecision in the block frequency computations. We saw an example of where such excessive hoisting stressed the register allocator in [JDK-8331295](https://bugs.openjdk.org/browse/JDK-8331295), but that issue was in fact two issues: one in the matcher (solved in [JDK-8331295](https://bugs.openjdk.org/browse/JDK-8331295)) and one in the scheduler (this issue).
>> 
>> ### Changeset
>> 
>> Add a small delta to the frequency comparison in `PhaseCFG::is_cheaper_block`. Note that a frequency comparison using the delta is already available in the function when making sure a hoist due to latency does not result in a higher (worse) frequency. I cannot see any reason for why we should not use the same delta in the first block frequency comparison.
>> 
>> I do not include a regression test since I have not found a good one specific to this issue. I have verified that this fix is an alternative solution to solve the failure in [JDK-8331295](https://bugs.openjdk.org/browse/JDK-8331295) (for which tests are already present). I also documented the verification steps in the issue description in JBS.
>> 
>> ### Testing
>> 
>> - [GitHub Actions](https://github.com/dlunde/jdk/actions/runs/12181425502)
>> - `tier1` to `tier4` (and additional Oracle-internal testing) on Windows x64, Linux x64, Linux aarch64, macOS x64, and macOS aarch64.
>> - Performance testing using DaCapo, SPECjbb, and SPECjvm on Windows x64, Linux x64, Linux aarch64, macOS x64, and macOS aarch64. No significant improvements nor regressions.
>
> Looks good, thanks for running more tests!
> 
> Here is some anecdotal evidence of the benefit of this change (from the first compilation in `java -Xcomp -XX:CompileOnly=java.util.ArrayList::toArray` ):
> 
> ![comparison](https://github.com/user-attachments/assets/e5af7035-c9d1-43b0-a468-ff7536575f1e)
> 
> In mainline (before), `270 leaPCompressedOopOffset` is unnecessarily hoisted from B52 to B50 (whose frequency should be equal but is slightly smaller due to numerical inaccuracies), which forces the register allocator to insert a register-to-register copy from R9 to RDI. With this change, `270 leaPCompressedOopOffset` stays in its original block and can be directly assigned the appropriate register RDI.
> 
> It might still be profitable to hoist long-latency instructions even in the face of same or slightly worse block frequencies, but that should be decided deliberately according to latency considerations and not accidentally due to numerical inaccuracies.

Thanks @robcasloz! Great that there are cases where the change makes a difference. Additional testing is now complete. I've updated the PR desription. For reference, here is the testing section again.

### Testing

- [GitHub Actions](https://github.com/dlunde/jdk/actions/runs/12181425502)
- `tier1` to `tier4` (and additional Oracle-internal testing) on Windows x64, Linux x64, Linux aarch64, macOS x64, and macOS aarch64.
- Performance testing using DaCapo, SPECjbb, and SPECjvm on Windows x64, Linux x64, Linux aarch64, macOS x64, and macOS aarch64. No significant improvements nor regressions.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/22810#issuecomment-2554134945