RFR: 8307513: C2: intrinsify Math.max(long,long) and Math.min(long,long) [v11]

Tue Feb 18 08:16:19 UTC 2025

On Mon, 17 Feb 2025 15:02:32 GMT, Roland Westrelin <roland at openjdk.org> wrote:

>> @rwestrel @galderz 
>> 
>>> It seems overall, we likely win more than we loose with this intrinsic, so I would integrate this change as it is and file a bug to keep track of remaining issues.
>> 
>> I'm a little scared to just accept the regressions, especially for this "most average looking case":
>> Imagine you have an array with random numbers. Or at least numbers in a random order. If we take the max, then we expect the first number to be max with probability 1, the second 1/2, the third 1/3, the i'th 1/i. So the average branch probability is `n / (sum_i 1/i)`. This goes closer and closer to zero, the larger the array. This means that the "average" case has an extreme probability. And so if we do not vectorize, then this gets us a regression with the current patch. And vectorization is a little fragile, it only takes very little for vectorization not to kick in.
>> 
>>> The Min/Max nodes are floating nodes. They can hoist out of loop and common reliably in ways that are not guaranteed otherwise.
>> 
>> I suppose we could write an optimization that can hoist out loop independent if-diamonds out of a loop. If the condition and all phi inputs are loop invariant, you could just cut the diamond out of the loop, and paste it before the loop entry.
>> 
>>> Shouldn't int min/max be affected the same way?
>> 
>> I think we should be able to see the same issue here, actually. Yes. Here a quick benchmark below:
>> 
>> 
>> java -XX:CompileCommand=compileonly,TestIntMax::test* -XX:CompileCommand=printcompilation,TestIntMax::test* -XX:+TraceNewVectors TestIntMax.java
>> CompileCommand: compileonly TestIntMax.test* bool compileonly = true
>> CompileCommand: PrintCompilation TestIntMax.test* bool PrintCompilation = true
>> Warmup
>> 5225   93 %     3       TestIntMax::test1 @ 5 (27 bytes)
>> 5226   94       3       TestIntMax::test1 (27 bytes)
>> 5226   95 %     4       TestIntMax::test1 @ 5 (27 bytes)
>> 5238   96       4       TestIntMax::test1 (27 bytes)
>> Run
>> Time: 542056319
>> Warmup
>> 6320  101 %     3       TestIntMax::test2 @ 5 (34 bytes)
>> 6322  102 %     4       TestIntMax::test2 @ 5 (34 bytes)
>> 6329  103       4       TestIntMax::test2 (34 bytes)
>> Run
>> Time: 166815209
>> 
>> That's a 4x regression on random input data!
>> 
>> With:
>> 
>> import java.util.Random;
>> 
>> public class TestIntMax {
>>     private static Random RANDOM = new Random();
>> 
>>     public static void main(String[] args) {
>>         int[] a = new int[64 * 1024];
>>         for (int i = 0; i < a.length; i++) {
>>...
>
>> I think we should be able to see the same issue here, actually. Yes. Here a quick benchmark below:
> 
> I observe the same:
> 
> 
> Warmup
> 751    3    b        TestIntMax::test1 (27 bytes)
> Run
> Time: 360 550 158
> Warmup
> 1862   15    b        TestIntMax::test2 (34 bytes)
> Run
> Time: 92 116 170
> 
> 
> But then with this:
> 
> 
> diff --git a/src/hotspot/cpu/x86/x86_64.ad b/src/hotspot/cpu/x86/x86_64.ad
> index 8cc4a970bfd..9abda8f4178 100644
> --- a/src/hotspot/cpu/x86/x86_64.ad
> +++ b/src/hotspot/cpu/x86/x86_64.ad
> @@ -12037,16 +12037,20 @@ instruct cmovI_reg_l(rRegI dst, rRegI src, rFlagsReg cr)
>  %}
>  
>  
> -instruct maxI_rReg(rRegI dst, rRegI src)
> +instruct maxI_rReg(rRegI dst, rRegI src, rFlagsReg cr)
>  %{
>    match(Set dst (MaxI dst src));
> +  effect(KILL cr);
>  
>    ins_cost(200);
> -  expand %{
> -    rFlagsReg cr;
> -    compI_rReg(cr, dst, src);
> -    cmovI_reg_l(dst, src, cr);
> +  ins_encode %{
> +    Label done;
> +    __ cmpl($src$$Register, $dst$$Register);
> +    __ jccb(Assembler::less, done);
> +    __ mov($dst$$Register, $src$$Register);
> +    __ bind(done);
>    %}
> +  ins_pipe(pipe_cmov_reg);
>  %}
>  
>  // ============================================================================
> 
> 
> the performance gap narrows:
> 
> 
> Warmup
> 770    3    b        TestIntMax::test1 (27 bytes)
> Run
> Time: 94 951 677
> Warmup
> 1312   15    b        TestIntMax::test2 (34 bytes)
> Run
> Time: 70 053 824
> 
> 
> (the number of test2 fluctuates quite a bit). Does it ever make sense to implement `MaxI` with a conditional move then?

Note something I spoke with @rwestrel yesterday in the context of long min/max vs int min/max.

Int has an ad implementation for min/max whereas long as not. My very first prototype of this issue was to mimmic what int did with log, but talking to @rwestrel we decided it would be better to implement this without introducing platform specific changes.

So, following Roland's thread in https://github.com/openjdk/jdk/pull/20098#issuecomment-2663379660, I could add ad changes for say x86 and aarch64 for long such that it uses branch instead of cmov. Note that the cmov fallback of long min/max comes from macro expansion, not platform specific changes.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/20098#issuecomment-2664893516