RFR: 8307513: C2: intrinsify Math.max(long,long) and Math.min(long,long) [v11]

Galder Zamarreño galder at openjdk.org
Tue Feb 18 08:20:15 UTC 2025


On Mon, 17 Feb 2025 15:02:32 GMT, Roland Westrelin <roland at openjdk.org> wrote:

>> @rwestrel @galderz 
>> 
>>> It seems overall, we likely win more than we loose with this intrinsic, so I would integrate this change as it is and file a bug to keep track of remaining issues.
>> 
>> I'm a little scared to just accept the regressions, especially for this "most average looking case":
>> Imagine you have an array with random numbers. Or at least numbers in a random order. If we take the max, then we expect the first number to be max with probability 1, the second 1/2, the third 1/3, the i'th 1/i. So the average branch probability is `n / (sum_i 1/i)`. This goes closer and closer to zero, the larger the array. This means that the "average" case has an extreme probability. And so if we do not vectorize, then this gets us a regression with the current patch. And vectorization is a little fragile, it only takes very little for vectorization not to kick in.
>> 
>>> The Min/Max nodes are floating nodes. They can hoist out of loop and common reliably in ways that are not guaranteed otherwise.
>> 
>> I suppose we could write an optimization that can hoist out loop independent if-diamonds out of a loop. If the condition and all phi inputs are loop invariant, you could just cut the diamond out of the loop, and paste it before the loop entry.
>> 
>>> Shouldn't int min/max be affected the same way?
>> 
>> I think we should be able to see the same issue here, actually. Yes. Here a quick benchmark below:
>> 
>> 
>> java -XX:CompileCommand=compileonly,TestIntMax::test* -XX:CompileCommand=printcompilation,TestIntMax::test* -XX:+TraceNewVectors TestIntMax.java
>> CompileCommand: compileonly TestIntMax.test* bool compileonly = true
>> CompileCommand: PrintCompilation TestIntMax.test* bool PrintCompilation = true
>> Warmup
>> 5225   93 %     3       TestIntMax::test1 @ 5 (27 bytes)
>> 5226   94       3       TestIntMax::test1 (27 bytes)
>> 5226   95 %     4       TestIntMax::test1 @ 5 (27 bytes)
>> 5238   96       4       TestIntMax::test1 (27 bytes)
>> Run
>> Time: 542056319
>> Warmup
>> 6320  101 %     3       TestIntMax::test2 @ 5 (34 bytes)
>> 6322  102 %     4       TestIntMax::test2 @ 5 (34 bytes)
>> 6329  103       4       TestIntMax::test2 (34 bytes)
>> Run
>> Time: 166815209
>> 
>> That's a 4x regression on random input data!
>> 
>> With:
>> 
>> import java.util.Random;
>> 
>> public class TestIntMax {
>>     private static Random RANDOM = new Random();
>> 
>>     public static void main(String[] args) {
>>         int[] a = new int[64 * 1024];
>>         for (int i = 0; i < a.length; i++) {
>>...
>
>> I think we should be able to see the same issue here, actually. Yes. Here a quick benchmark below:
> 
> I observe the same:
> 
> 
> Warmup
> 751    3    b        TestIntMax::test1 (27 bytes)
> Run
> Time: 360 550 158
> Warmup
> 1862   15    b        TestIntMax::test2 (34 bytes)
> Run
> Time: 92 116 170
> 
> 
> But then with this:
> 
> 
> diff --git a/src/hotspot/cpu/x86/x86_64.ad b/src/hotspot/cpu/x86/x86_64.ad
> index 8cc4a970bfd..9abda8f4178 100644
> --- a/src/hotspot/cpu/x86/x86_64.ad
> +++ b/src/hotspot/cpu/x86/x86_64.ad
> @@ -12037,16 +12037,20 @@ instruct cmovI_reg_l(rRegI dst, rRegI src, rFlagsReg cr)
>  %}
>  
>  
> -instruct maxI_rReg(rRegI dst, rRegI src)
> +instruct maxI_rReg(rRegI dst, rRegI src, rFlagsReg cr)
>  %{
>    match(Set dst (MaxI dst src));
> +  effect(KILL cr);
>  
>    ins_cost(200);
> -  expand %{
> -    rFlagsReg cr;
> -    compI_rReg(cr, dst, src);
> -    cmovI_reg_l(dst, src, cr);
> +  ins_encode %{
> +    Label done;
> +    __ cmpl($src$$Register, $dst$$Register);
> +    __ jccb(Assembler::less, done);
> +    __ mov($dst$$Register, $src$$Register);
> +    __ bind(done);
>    %}
> +  ins_pipe(pipe_cmov_reg);
>  %}
>  
>  // ============================================================================
> 
> 
> the performance gap narrows:
> 
> 
> Warmup
> 770    3    b        TestIntMax::test1 (27 bytes)
> Run
> Time: 94 951 677
> Warmup
> 1312   15    b        TestIntMax::test2 (34 bytes)
> Run
> Time: 70 053 824
> 
> 
> (the number of test2 fluctuates quite a bit). Does it ever make sense to implement `MaxI` with a conditional move then?

To make it more explicit: implementing long min/max in ad files as cmp will likely remove all the 100% regressions that are observed here. I'm going to repeat the same MinMaxVector int min/max reduction test above with the ad changes @rwestrel suggested to see what effect they have.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/20098#issuecomment-2664903731


More information about the core-libs-dev mailing list