RFR: 8307513: C2: intrinsify Math.max(long,long) and Math.min(long,long) [v12]
Emanuel Peter
epeter at openjdk.org
Thu Feb 27 06:57:04 UTC 2025
On Wed, 26 Feb 2025 18:29:58 GMT, Galder Zamarreño <galder at openjdk.org> wrote:
>>> > Re: [#20098 (comment)](https://github.com/openjdk/jdk/pull/20098#issuecomment-2671144644) - I was trying to think what could be causing this.
>>>
>>> Maybe it is an issue with probabilities? Do you know at what point (if at all) the `MinI` node appears/disappears in that example?
>>
>> The probabilities are fine.
>>
>> I think the issue with `Math.min(II)` seems to be specific to when its compilation happens, and the combined fact that the intrinsic has been disabled and vectorization does not kick in (explicitly disabled). Note that other parts of the JDK invoke `Math.min(II)`.
>>
>> In the slow cases it appears the compilation happens before the benchmark kicks in, and so it takes the profiling data before the benchmark to decide how to compile this in.
>>
>> In the slow versions you see this `PrintMethodData`:
>>
>> static java.lang.Math::min(II)I
>> interpreter_invocation_count: 18171
>> invocation_counter: 18171
>> backedge_counter: 0
>> decompile_count: 0
>> mdo size: 328 bytes
>>
>> 0 iload_0
>> 1 iload_1
>> 2 if_icmpgt 9
>> 0 bci: 2 BranchData taken(7732) displacement(56)
>> not taken(10180)
>> 5 iload_0
>> 6 goto 10
>> 32 bci: 6 JumpData taken(10180) displacement(24)
>> 9 iload_1
>> 10 ireturn
>>
>> org.openjdk.bench.java.lang.MinMaxVector::intReductionSimpleMin(Lorg/openjdk/bench/java/lang/MinMaxVector$LoopState;)I
>> interpreter_invocation_count: 189
>> invocation_counter: 189
>> backedge_counter: 313344
>> decompile_count: 0
>> mdo size: 384 bytes
>>
>> 0 iconst_0
>> 1 istore_2
>> 2 iconst_0
>> 3 istore_3
>> 4 iload_3
>> 5 aload_1
>> 6 fast_igetfield 35 <org/openjdk/bench/java/lang/MinMaxVector$LoopState.size:I>
>> 9 if_icmpge 33
>> 0 bci: 9 BranchData taken(58) displacement(72)
>> not taken(192512)
>> 12 aload_1
>> 13 fast_agetfield 41 <org/openjdk/bench/java/lang/MinMaxVector$LoopState.minIntA:[I>
>> 16 iload_3
>> 17 iaload
>> 18 istore #4
>> 20 iload_2
>> 21 fast_iload #4
>> 23 invokestatic 32 <java/lang/Math.min(II)I>
>> 32 bci: 23 CounterData count(192512)
>> 26 istore_2
>> 27 iinc #3 1
>> 30 goto 4
>> 48 bci: 30 JumpData taken(192512) displacement(-48)
>> 33 iload_2
>> 34 ireturn
>>
>>
>> The benchmark method calls Math...
>
>> > That said: if we know that it is only in the high-probability cases, then we can address those separately. I would not consider it a blocking issue, as long as we file the follow-up RFE for int/max scalar case with high branch probability.
>> > What would be really helpful: a list of all regressions / issues, and how we intend to deal with them. If we later find a regression that someone cares about, then we can come back to that list, and justify the decision we made here.
>>
>> I'll make up a list of regressions and post it here. I won't create RFEs for now. I'd rather wait until we have the list in front of us and we can decide which RFEs to create.
>
> Before noting the regressions, it's worth noting that PR also improves performance certain scenarios. I will summarise those tomorrow.
>
> Here's a summary of the regressions
>
> ### Regression 1
> Given a loop with a long min/max reduction pattern with one side of branch taken near 100% of time, when Supeword finds the pattern not profitable, then HotSpot will use scalar instructions (cmov) and performance will regress.
>
> Possible solutions:
> a) make Superword recognise these scenarios as profitable.
>
> ### Regression 2
> Given a loop with a long min/max reduction pattern with one side of branch near 100% of time, when the platform does not support vector instructions to achieve this (e.g. AVX-512 quad word vpmax/vpmin), then HotSpot will use scalar instructions (cmov) and performance will regress.
>
> Possible solutions
> a) find a way to use other vector instructions (vpcmp+vpblend+vmov?)
> b) fallback on more suitable scalar instructions, e.g. cmp+mov, when the branch is very one-sided
>
> ### Regression 3
> Given a loop with a long min/max non-reduction pattern (e.g. `longLoopMax`) with one side of branch taken near 100% of time, when the platform does not vectorize it (either lack of CPU instruction support, or Superword finding not profitable), then HotSpot will use scalar instructions (cmov) and performance will regress.
>
> Possible solutions:
> a) find a way to use other vector instructions (e.g. `longLoopMax` vectorizes with AVX2 and might also do with earlier instruction sets)
> b) fallback on more suitable scalar instructions, e.g. cmp+mov, when the branch is very one-sided,
@galderz Thanks for the summary of regressions! Yes, there are plenty of speedups, I assume primarily because of `Long.min/max` vectorization, but possibly also because the operation can now "float" out of a loop for example.
All your Regressions 1-3 are cases with "extreme" probabilitiy (close to 100% / 0%), you listed none else. That matches with my intuition, that branching code is usually better than cmove in extreme probability cases.
As for possible solutions. In all Regression 1-3 cases, it seems the issue is scalar cmove. So actually in all cases a possible solution is using branching code (i.e. `cmp+mov`). So to me, these are the follow-up RFE's:
- Detect "extreme" probability scalar cmove, and replace them with branching code. This should take care of all regressions here. This one has high priority, as it fixes the regression caused by this patch here. But it would also help to improve performance for the `Integer.min/max` cases, which have the same issue.
- Additional performance improvement: make SuperWord recognize more cases as profitble (see Regression 1). Optional.
- Additional performance improvement: extend backend capabilities for vectorization (see Regression 2 + 3). Optional.
Does that make sense, or am I missing something?
-------------
PR Comment: https://git.openjdk.org/jdk/pull/20098#issuecomment-2687067125
More information about the core-libs-dev
mailing list