RFR: 8307513: C2: intrinsify Math.max(long,long) and Math.min(long,long) [v11]

Mon Feb 17 16:49:15 UTC 2025

On Mon, 17 Feb 2025 15:02:32 GMT, Roland Westrelin <roland at openjdk.org> wrote:

>> @rwestrel @galderz 
>> 
>>> It seems overall, we likely win more than we loose with this intrinsic, so I would integrate this change as it is and file a bug to keep track of remaining issues.
>> 
>> I'm a little scared to just accept the regressions, especially for this "most average looking case":
>> Imagine you have an array with random numbers. Or at least numbers in a random order. If we take the max, then we expect the first number to be max with probability 1, the second 1/2, the third 1/3, the i'th 1/i. So the average branch probability is `n / (sum_i 1/i)`. This goes closer and closer to zero, the larger the array. This means that the "average" case has an extreme probability. And so if we do not vectorize, then this gets us a regression with the current patch. And vectorization is a little fragile, it only takes very little for vectorization not to kick in.
>> 
>>> The Min/Max nodes are floating nodes. They can hoist out of loop and common reliably in ways that are not guaranteed otherwise.
>> 
>> I suppose we could write an optimization that can hoist out loop independent if-diamonds out of a loop. If the condition and all phi inputs are loop invariant, you could just cut the diamond out of the loop, and paste it before the loop entry.
>> 
>>> Shouldn't int min/max be affected the same way?
>> 
>> I think we should be able to see the same issue here, actually. Yes. Here a quick benchmark below:
>> 
>> 
>> java -XX:CompileCommand=compileonly,TestIntMax::test* -XX:CompileCommand=printcompilation,TestIntMax::test* -XX:+TraceNewVectors TestIntMax.java
>> CompileCommand: compileonly TestIntMax.test* bool compileonly = true
>> CompileCommand: PrintCompilation TestIntMax.test* bool PrintCompilation = true
>> Warmup
>> 5225   93 %     3       TestIntMax::test1 @ 5 (27 bytes)
>> 5226   94       3       TestIntMax::test1 (27 bytes)
>> 5226   95 %     4       TestIntMax::test1 @ 5 (27 bytes)
>> 5238   96       4       TestIntMax::test1 (27 bytes)
>> Run
>> Time: 542056319
>> Warmup
>> 6320  101 %     3       TestIntMax::test2 @ 5 (34 bytes)
>> 6322  102 %     4       TestIntMax::test2 @ 5 (34 bytes)
>> 6329  103       4       TestIntMax::test2 (34 bytes)
>> Run
>> Time: 166815209
>> 
>> That's a 4x regression on random input data!
>> 
>> With:
>> 
>> import java.util.Random;
>> 
>> public class TestIntMax {
>>     private static Random RANDOM = new Random();
>> 
>>     public static void main(String[] args) {
>>         int[] a = new int[64 * 1024];
>>         for (int i = 0; i < a.length; i++) {
>>...
>
>> I think we should be able to see the same issue here, actually. Yes. Here a quick benchmark below:
> 
> I observe the same:
> 
> 
> Warmup
> 751    3    b        TestIntMax::test1 (27 bytes)
> Run
> Time: 360 550 158
> Warmup
> 1862   15    b        TestIntMax::test2 (34 bytes)
> Run
> Time: 92 116 170
> 
> 
> But then with this:
> 
> 
> diff --git a/src/hotspot/cpu/x86/x86_64.ad b/src/hotspot/cpu/x86/x86_64.ad
> index 8cc4a970bfd..9abda8f4178 100644
> --- a/src/hotspot/cpu/x86/x86_64.ad
> +++ b/src/hotspot/cpu/x86/x86_64.ad
> @@ -12037,16 +12037,20 @@ instruct cmovI_reg_l(rRegI dst, rRegI src, rFlagsReg cr)
>  %}
>  
>  
> -instruct maxI_rReg(rRegI dst, rRegI src)
> +instruct maxI_rReg(rRegI dst, rRegI src, rFlagsReg cr)
>  %{
>    match(Set dst (MaxI dst src));
> +  effect(KILL cr);
>  
>    ins_cost(200);
> -  expand %{
> -    rFlagsReg cr;
> -    compI_rReg(cr, dst, src);
> -    cmovI_reg_l(dst, src, cr);
> +  ins_encode %{
> +    Label done;
> +    __ cmpl($src$$Register, $dst$$Register);
> +    __ jccb(Assembler::less, done);
> +    __ mov($dst$$Register, $src$$Register);
> +    __ bind(done);
>    %}
> +  ins_pipe(pipe_cmov_reg);
>  %}
>  
>  // ============================================================================
> 
> 
> the performance gap narrows:
> 
> 
> Warmup
> 770    3    b        TestIntMax::test1 (27 bytes)
> Run
> Time: 94 951 677
> Warmup
> 1312   15    b        TestIntMax::test2 (34 bytes)
> Run
> Time: 70 053 824
> 
> 
> (the number of test2 fluctuates quite a bit). Does it ever make sense to implement `MaxI` with a conditional move then?

@rwestrel @eme64 I think that the data distribution in the `TestIntMax` above matters (see my explanations in https://github.com/openjdk/jdk/pull/20098#issuecomment-2642788364), so I've enhanced the test to control data distribution in the int[] (see at the bottom).

Here are the results I see on my AVX-512 machine:

Probability: 50%
Warmup
7834   92 %  b  3       TestIntMax::test1 @ 5 (27 bytes)
7836   93    b  3       TestIntMax::test1 (27 bytes)
7838   94 %  b  4       TestIntMax::test1 @ 5 (27 bytes)
7851   95    b  4       TestIntMax::test1 (27 bytes)
Run
Time: 699 923 014
Warmup
9272   96 %  b  3       TestIntMax::test2 @ 5 (34 bytes)
9274   97    b  3       TestIntMax::test2 (34 bytes)
9275   98 %  b  4       TestIntMax::test2 @ 5 (34 bytes)
9287   99    b  4       TestIntMax::test2 (34 bytes)
Run
Time: 699 815 792

Probability: 80%
Warmup
7872   92 %  b  3       TestIntMax::test1 @ 5 (27 bytes)
7874   93    b  3       TestIntMax::test1 (27 bytes)
7875   94 %  b  4       TestIntMax::test1 @ 5 (27 bytes)
7889   95    b  4       TestIntMax::test1 (27 bytes)
Run
Time: 699 947 633
Warmup
9310   96 %  b  3       TestIntMax::test2 @ 5 (34 bytes)
9311   97    b  3       TestIntMax::test2 (34 bytes)
9312   98 %  b  4       TestIntMax::test2 @ 5 (34 bytes)
9325   99    b  4       TestIntMax::test2 (34 bytes)
Run
Time: 699 827 882

Probability: 100%
Warmup
7884   92 %  b  3       TestIntMax::test1 @ 5 (27 bytes)
7886   93    b  3       TestIntMax::test1 (27 bytes)
7888   94 %  b  4       TestIntMax::test1 @ 5 (27 bytes)
7901   95    b  4       TestIntMax::test1 (27 bytes)
Run
Time: 699 931 243
Warmup
9322   96 %  b  3       TestIntMax::test2 @ 5 (34 bytes)
9323   97    b  3       TestIntMax::test2 (34 bytes)
9324   98 %  b  4       TestIntMax::test2 @ 5 (34 bytes)
9336   99    b  4       TestIntMax::test2 (34 bytes)
Run
Time: 1 077 937 282

import java.util.Random;
import java.util.concurrent.ThreadLocalRandom;
import java.text.DecimalFormat;
import java.text.DecimalFormatSymbols;

class TestIntMax
{
    static final int RANGE = 16 * 1024;
    static final int ITER = 100_000;

    public static void main(String[] args)
    {
        final int probability = Integer.parseInt(args[0]);

        final DecimalFormatSymbols symbols = new DecimalFormatSymbols();
        symbols.setGroupingSeparator(' ');
        final DecimalFormat format = new DecimalFormat("#,###", symbols);

        System.out.printf("Probability: %d%%%n", probability);
        int[] a = new int[64 * 1024];
        init(a, probability);

        {
            System.out.println("Warmup");
            for (int i = 0; i < 10_000; i++)
            {
                test1(a);
            }
            System.out.println("Run");
            long t0 = System.nanoTime();
            for (int i = 0; i < 10_000; i++)
            {
                test1(a);
            }
            long t1 = System.nanoTime();
            System.out.println("Time: " + format.format(t1 - t0));
        }

        {
            System.out.println("Warmup");
            for (int i = 0; i < 10_000; i++)
            {
                test2(a);
            }
            System.out.println("Run");
            long t0 = System.nanoTime();
            for (int i = 0; i < 10_000; i++)
            {
                test2(a);
            }
            long t1 = System.nanoTime();
            System.out.println("Time: " + format.format(t1 - t0));
        }
    }

    public static int test1(int[] a)
    {
        int x = Integer.MIN_VALUE;
        for (int i = 0; i < a.length; i++)
        {
            x = Math.max(x, a[i]);
        }
        return x;
    }

    public static int test2(int[] a)
    {
        int x = Integer.MIN_VALUE;
        for (int i = 0; i < a.length; i++)
        {
            x = (x >= a[i]) ? x : a[i];
        }
        return x;
    }

    public static void init(int[] ints, int probability)
    {
        int aboveCount, abovePercent;

        do
        {
            int max = ThreadLocalRandom.current().nextInt(10);
            ints[0] = max;

            aboveCount = 0;
            for (int i = 1; i < ints.length; i++)
            {
                int value;
                if (ThreadLocalRandom.current().nextInt(101) <= probability)
                {
                    int increment = ThreadLocalRandom.current().nextInt(10);
                    value = max + increment;
                    aboveCount++;
                }
                else
                {
                    // Decrement by at least 1
                    int decrement = ThreadLocalRandom.current().nextInt(10) + 1;
                    value = max - decrement;
                }
                ints[i] = value;
                max = Math.max(max, value);
            }

            abovePercent = ((aboveCount + 1) * 100) / ints.length;
        } while (abovePercent != probability);
    }
}

Focusing my comment below on 100% which is where the differences appear:

test2 (100%):

 ;; B12: #	out( B21 B13 ) <- in( B11 B20 )  Freq: 1.6744e+09
  0x00007f15bcada2e9:   movl		0x14(%rsi, %rdx, 4), %r11d
                                                            ;*iaload {reexecute=0 rethrow=0 return_oop=0}
                                                            ; - TestIntMax::test2 at 14 (line 71)
  0x00007f15bcada2ee:   cmpl		%r11d, %r10d
  0x00007f15bcada2f1:   jge		0x7f15bcada362      ;*istore_1 {reexecute=0 rethrow=0 return_oop=0}
                                                            ; - TestIntMax::test2 at 25 (line 71)

test1 (100%)

 ;; B10: #	out( B10 B11 ) <- in( B9 B10 ) Loop( B10-B10 inner main of N64 strip mined) Freq: 1.6744e+09
  0x00007f15bcad9a70:   movl		0x4c(%rsi, %rdx, 4), %r11d
  0x00007f15bcad9a75:   movl		%r11d, (%rsp)
  0x00007f15bcad9a79:   movl		0x48(%rsi, %rdx, 4), %r10d
  0x00007f15bcad9a7e:   movl		%r10d, 4(%rsp)
  0x00007f15bcad9a83:   movl		0x10(%rsi, %rdx, 4), %r11d
  0x00007f15bcad9a88:   movl		0x14(%rsi, %rdx, 4), %r9d
  0x00007f15bcad9a8d:   movl		0x44(%rsi, %rdx, 4), %r10d
  0x00007f15bcad9a92:   movl		%r10d, 8(%rsp)
  0x00007f15bcad9a97:   movl		0x18(%rsi, %rdx, 4), %r8d
  0x00007f15bcad9a9c:   cmpl		%r11d, %eax
  0x00007f15bcad9a9f:   cmovll		%r11d, %eax
  0x00007f15bcad9aa3:   cmpl		%r9d, %eax
  0x00007f15bcad9aa6:   cmovll		%r9d, %eax
  0x00007f15bcad9aaa:   movl		0x20(%rsi, %rdx, 4), %r10d
  0x00007f15bcad9aaf:   cmpl		%r8d, %eax
  0x00007f15bcad9ab2:   cmovll		%r8d, %eax
  0x00007f15bcad9ab6:   movl		0x24(%rsi, %rdx, 4), %r8d
  0x00007f15bcad9abb:   movl		0x28(%rsi, %rdx, 4), %r11d
                                                            ;   {no_reloc}
  0x00007f15bcad9ac0:   movl		0x2c(%rsi, %rdx, 4), %ecx
  0x00007f15bcad9ac4:   movl		0x30(%rsi, %rdx, 4), %r9d
  0x00007f15bcad9ac9:   movl		0x34(%rsi, %rdx, 4), %edi
  0x00007f15bcad9acd:   movl		0x38(%rsi, %rdx, 4), %ebx
  0x00007f15bcad9ad1:   movl		0x3c(%rsi, %rdx, 4), %ebp
  0x00007f15bcad9ad5:   movl		0x40(%rsi, %rdx, 4), %r13d
  0x00007f15bcad9ada:   movl		0x1c(%rsi, %rdx, 4), %r14d
  0x00007f15bcad9adf:   cmpl		%r14d, %eax
  0x00007f15bcad9ae2:   cmovll		%r14d, %eax
  0x00007f15bcad9ae6:   cmpl		%r10d, %eax
  0x00007f15bcad9ae9:   cmovll		%r10d, %eax
  0x00007f15bcad9aed:   cmpl		%r8d, %eax
  0x00007f15bcad9af0:   cmovll		%r8d, %eax
  0x00007f15bcad9af4:   cmpl		%r11d, %eax
  0x00007f15bcad9af7:   cmovll		%r11d, %eax
  0x00007f15bcad9afb:   cmpl		%ecx, %eax
  0x00007f15bcad9afd:   cmovll		%ecx, %eax
  0x00007f15bcad9b00:   cmpl		%r9d, %eax
  0x00007f15bcad9b03:   cmovll		%r9d, %eax
  0x00007f15bcad9b07:   cmpl		%edi, %eax
  0x00007f15bcad9b09:   cmovll		%edi, %eax
  0x00007f15bcad9b0c:   cmpl		%ebx, %eax
  0x00007f15bcad9b0e:   cmovll		%ebx, %eax
  0x00007f15bcad9b11:   cmpl		%ebp, %eax
  0x00007f15bcad9b13:   cmovll		%ebp, %eax
  0x00007f15bcad9b16:   cmpl		%r13d, %eax
  0x00007f15bcad9b19:   cmovll		%r13d, %eax
  0x00007f15bcad9b1d:   cmpl		8(%rsp), %eax
  0x00007f15bcad9b21:   movl		8(%rsp), %r11d
  0x00007f15bcad9b26:   cmovll		%r11d, %eax
  0x00007f15bcad9b2a:   cmpl		4(%rsp), %eax
  0x00007f15bcad9b2e:   movl		4(%rsp), %r10d
  0x00007f15bcad9b33:   cmovll		%r10d, %eax
  0x00007f15bcad9b37:   cmpl		(%rsp), %eax
  0x00007f15bcad9b3a:   movl		(%rsp), %r11d
  0x00007f15bcad9b3e:   cmovll		%r11d, %eax         ;*invokestatic max {reexecute=0 rethrow=0 return_oop=0}
                                                            ; - TestIntMax::test1 at 15 (line 61)

-------------

PR Comment: https://git.openjdk.org/jdk/pull/20098#issuecomment-2663633050