RFR: 8307513: C2: intrinsify Math.max(long,long) and Math.min(long,long) [v11]

Tue Feb 18 08:46:17 UTC 2025

On Tue, 18 Feb 2025 08:17:59 GMT, Galder Zamarreño <galder at openjdk.org> wrote:

>>> I think we should be able to see the same issue here, actually. Yes. Here a quick benchmark below:
>> 
>> I observe the same:
>> 
>> 
>> Warmup
>> 751    3    b        TestIntMax::test1 (27 bytes)
>> Run
>> Time: 360 550 158
>> Warmup
>> 1862   15    b        TestIntMax::test2 (34 bytes)
>> Run
>> Time: 92 116 170
>> 
>> 
>> But then with this:
>> 
>> 
>> diff --git a/src/hotspot/cpu/x86/x86_64.ad b/src/hotspot/cpu/x86/x86_64.ad
>> index 8cc4a970bfd..9abda8f4178 100644
>> --- a/src/hotspot/cpu/x86/x86_64.ad
>> +++ b/src/hotspot/cpu/x86/x86_64.ad
>> @@ -12037,16 +12037,20 @@ instruct cmovI_reg_l(rRegI dst, rRegI src, rFlagsReg cr)
>>  %}
>>  
>>  
>> -instruct maxI_rReg(rRegI dst, rRegI src)
>> +instruct maxI_rReg(rRegI dst, rRegI src, rFlagsReg cr)
>>  %{
>>    match(Set dst (MaxI dst src));
>> +  effect(KILL cr);
>>  
>>    ins_cost(200);
>> -  expand %{
>> -    rFlagsReg cr;
>> -    compI_rReg(cr, dst, src);
>> -    cmovI_reg_l(dst, src, cr);
>> +  ins_encode %{
>> +    Label done;
>> +    __ cmpl($src$$Register, $dst$$Register);
>> +    __ jccb(Assembler::less, done);
>> +    __ mov($dst$$Register, $src$$Register);
>> +    __ bind(done);
>>    %}
>> +  ins_pipe(pipe_cmov_reg);
>>  %}
>>  
>>  // ============================================================================
>> 
>> 
>> the performance gap narrows:
>> 
>> 
>> Warmup
>> 770    3    b        TestIntMax::test1 (27 bytes)
>> Run
>> Time: 94 951 677
>> Warmup
>> 1312   15    b        TestIntMax::test2 (34 bytes)
>> Run
>> Time: 70 053 824
>> 
>> 
>> (the number of test2 fluctuates quite a bit). Does it ever make sense to implement `MaxI` with a conditional move then?
>
> To make it more explicit: implementing long min/max in ad files as cmp will likely remove all the 100% regressions that are observed here. I'm going to repeat the same MinMaxVector int min/max reduction test above with the ad changes @rwestrel suggested to see what effect they have.

@galderz I think we will have the same issue with both `int` and `long`: As far as I know, it is really a difficult problem to decide at compile-time if a `cmove` or `branch` is the better choice. I'm not sure there is any heuristic for which you will not find a micro-benchmark where the heuristic made the wrong choice.

To my understanding, these are the factors that impact the performance:
- `cmove` requires all inputs to complete before it can execute, and it has an inherent latency of a cycle or so itself. But you cannot have any branch mispredictions, and hence no branch misprediction penalties (i.e. when the CPU has to flush out the ops from the wrong branch and restart at the branch).
- `branch` can hide some latencies, because we can already continue with the branch that is speculated on. We do not need to wait for the inputs of the comparison to arrive, and we can already continue with the speculated resulting value. But if the speculation is ever wrong, we have to pay the misprediction penalty.

In my understanding, there are roughly 3 scenarios:
- The branch probability is so extreme that the branch predictor would be correct almost always, and so it is profitable to do branching code.
- The branching probability is somewhere in the middle, and the branch is not predictable. Branch mispredictions are very expensive, and so it is better to use `cmove`.
- The branching probability is somewhere in the middle, but the branch is predictable (e.g. swapps back and forth). The branch predictor will have almost no mispredictions, and it is faster to use branching code.

Modeling this precisely is actually a little complex. You would have to know the cost of the `cmove` and the `branching` version of the code. That depends on the latency of the inputs, and the outputs: does the `cmove` dramatically increase the latency on the critical path, and `branching` could hide some of that latency? And you would have to know how good the branch predictor is, which you cannot derive from the branching probability of our profiling (at least not when the probabilities are in the middle, and you don't know if it is a random or predictable pattern).

If we can find a perfect heuristic - that would be fantastic ;)

If we cannot find a perfect heuristic, then we should think about what are the most "common" or "relevant" scenarios, I think.

But let's discuss all of this in a call / offline :)

-------------

PR Comment: https://git.openjdk.org/jdk/pull/20098#issuecomment-2664956307