RFR: 8346989: C2: deoptimization and re-compilation cycle with Math.*Exact in case of frequent overflow [v2]

Mon Mar 31 07:54:14 UTC 2025

On Wed, 26 Mar 2025 08:33:58 GMT, Marc Chevalier <mchevalier at openjdk.org> wrote:

>> `Math.*Exact` intrinsics can cause many deopt when used repeatedly with problematic arguments.
>> This fix proposes not to rely on intrinsics after `too_many_traps()` has been reached.
>> 
>> Benchmark show that this issue affects every Math.*Exact functions. And this fix improve them all.
>> 
>> tl;dr:
>> - C1: no problem, no change
>> - C2:
>>   - with intrinsics:
>>     - with overflow: clear improvement. Was way worse than C1, now is similar (~4s => ~600ms)
>>     - without overflow: no problem, no change
>>   - without intrinsics: no problem, no change
>> 
>> Before the fix:
>> 
>> Benchmark                                           (SIZE)  Mode  Cnt     Score      Error  Units
>> MathExact.C1_1.loopAddIInBounds                    1000000  avgt    3     1.272 ±    0.048  ms/op
>> MathExact.C1_1.loopAddIOverflow                    1000000  avgt    3   641.917 ±   58.238  ms/op
>> MathExact.C1_1.loopAddLInBounds                    1000000  avgt    3     1.402 ±    0.842  ms/op
>> MathExact.C1_1.loopAddLOverflow                    1000000  avgt    3   671.013 ±  229.425  ms/op
>> MathExact.C1_1.loopDecrementIInBounds              1000000  avgt    3     3.722 ±   22.244  ms/op
>> MathExact.C1_1.loopDecrementIOverflow              1000000  avgt    3   653.341 ±  279.003  ms/op
>> MathExact.C1_1.loopDecrementLInBounds              1000000  avgt    3     2.525 ±    0.810  ms/op
>> MathExact.C1_1.loopDecrementLOverflow              1000000  avgt    3   656.750 ±  141.792  ms/op
>> MathExact.C1_1.loopIncrementIInBounds              1000000  avgt    3     4.621 ±   12.822  ms/op
>> MathExact.C1_1.loopIncrementIOverflow              1000000  avgt    3   651.608 ±  274.396  ms/op
>> MathExact.C1_1.loopIncrementLInBounds              1000000  avgt    3     2.576 ±    3.316  ms/op
>> MathExact.C1_1.loopIncrementLOverflow              1000000  avgt    3   662.216 ±   71.879  ms/op
>> MathExact.C1_1.loopMultiplyIInBounds               1000000  avgt    3     1.402 ±    0.587  ms/op
>> MathExact.C1_1.loopMultiplyIOverflow               1000000  avgt    3   615.836 ±  252.137  ms/op
>> MathExact.C1_1.loopMultiplyLInBounds               1000000  avgt    3     2.906 ±    5.718  ms/op
>> MathExact.C1_1.loopMultiplyLOverflow               1000000  avgt    3   655.576 ±  147.432  ms/op
>> MathExact.C1_1.loopNegateIInBounds                 1000000  avgt    3     2.023 ±    0.027  ms/op
>> MathExact.C1_1.loopNegateIOverflow                 1000000  avgt    3   639.136 ±   30.841  ms/op
>> MathExact.C1_1.loop...
>
> Marc Chevalier has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains four additional commits since the last revision:
> 
>  - Use builtin_throw
>  - Merge branch 'master' into fix/Deoptimization-and-re-compilation-cycle-with-C2-compiled-code
>  - More exhaustive bench
>  - Limit inlining of math Exact operations in case of too many deopts

Actually, yes, there is a reason I've made it so weird (and I agree it's pretty convoluted).
`builtin_throw` kicks in if `too_many_traps(reason)` is true (and another case, but it might not apply):
https://github.com/openjdk/jdk/blob/59629f88e6fad9c1ff91be4cfea83f78f0ea503c/src/hotspot/share/opto/graphKit.cpp#L540-L555
If `treat_throw_as_hot` is false (so before too many traps) it just ends up as a `uncommon_trap` with `Action_maybe_recompile` action. That is fine at first. But later, we would like `builtin_throw` to do its job, but it can only do if if
https://github.com/openjdk/jdk/blob/59629f88e6fad9c1ff91be4cfea83f78f0ea503c/src/hotspot/share/opto/graphKit.cpp#L563
which is not `too_many_traps(reason)`. Which means that:
- if we don't bailout intrinsics on `too_many_traps(reason)` we may be in the same situation as in the bug, with deopt cycles, in the situation where `builtin_throw` doesn't do it's job (for instance `method()->can_omit_stack_trace()` is false)
- if we bailout intrincs on `too_many_traps(reason)`, then `builtin_throw` never get a hot enough throw that it can speed up, and we have the same situation as my first version, before you suggested `builtin_throw` (with performances similar for C2 and C1).

In other words, we need `too_many_traps(reason)` to be reached to have `builtin_throw` start to have a change to do something, but it might not, and in this case, we need to bailout from intrinsics otherwise, we will repeatedly deopt. So, when `too_many_traps(reason)` is true, we have two options: either we give it to `builtin_throw` or we bailout. And to avoid the deopt cycles, we must know in advance if `builtin_throw` will do its job or just default to an `uncommon_trap` again (in which case, bailing out is better). This is why I extracted the condition for `builtin_throw` into `builtin_throw_applies`: so that intrinsic can decide what is best to do.

Some of your suggestions are still relevant tho! I'll apply them.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/23916#issuecomment-2765414288