RFR: 8307513: C2: intrinsify Math.max(long,long) and Math.min(long,long) [v11]

Mon Feb 17 10:39:15 UTC 2025

On Mon, 17 Feb 2025 08:44:46 GMT, Roland Westrelin <roland at openjdk.org> wrote:

>>> I suppose extracting the branch probability from the MethodData and attaching it to the Min/Max nodes is not impossible.
>> 
>> That is basically what `PhaseIdealLoop::conditional_move` already does, right? It detects the diamond and converts it to `CMove`. We could special case for `min / max`, and then we'd have the probability for the branch, which we could store at the node.
>
>> > I suppose extracting the branch probability from the MethodData and attaching it to the Min/Max nodes is not impossible.
>> 
>> That is basically what `PhaseIdealLoop::conditional_move` already does, right? It detects the diamond and converts it to `CMove`. We could special case for `min / max`, and then we'd have the probability for the branch, which we could store at the node.
> 
> Possibly. We could also create the intrinsic they way it's done in the patch and extract the frequency from the `MethoData` for the min or max methods. The shape of the bytecodes for these methods should be simple enough that it should be feasible.

@rwestrel @galderz 

> It seems overall, we likely win more than we loose with this intrinsic, so I would integrate this change as it is and file a bug to keep track of remaining issues.

I'm a little scared to just accept the regressions, especially for this "most average looking case":
Imagine you have an array with random numbers. Or at least numbers in a random order. If we take the max, then we expect the first number to be max with probability 1, the second 1/2, the third 1/3, the i'th 1/i. So the average branch probability is `n / (sum_i 1/i)`. This goes closer and closer to zero, the larger the array. This means that the "average" case has an extreme probability. And so if we do not vectorize, then this gets us a regression with the current patch. And vectorization is a little fragile, it only takes very little for vectorization not to kick in.

> The Min/Max nodes are floating nodes. They can hoist out of loop and common reliably in ways that are not guaranteed otherwise.

I suppose we could write an optimization that can hoist out loop independent if-diamonds out of a loop. If the condition and all phi inputs are loop invariant, you could just cut the diamond out of the loop, and paste it before the loop entry.

> Shouldn't int min/max be affected the same way?

I think we should be able to see the same issue here, actually. Yes. Here a quick benchmark below:

java -XX:CompileCommand=compileonly,TestIntMax::test* -XX:CompileCommand=printcompilation,TestIntMax::test* -XX:+TraceNewVectors TestIntMax.java
CompileCommand: compileonly TestIntMax.test* bool compileonly = true
CompileCommand: PrintCompilation TestIntMax.test* bool PrintCompilation = true
Warmup
5225   93 %     3       TestIntMax::test1 @ 5 (27 bytes)
5226   94       3       TestIntMax::test1 (27 bytes)
5226   95 %     4       TestIntMax::test1 @ 5 (27 bytes)
5238   96       4       TestIntMax::test1 (27 bytes)
Run
Time: 542056319
Warmup
6320  101 %     3       TestIntMax::test2 @ 5 (34 bytes)
6322  102 %     4       TestIntMax::test2 @ 5 (34 bytes)
6329  103       4       TestIntMax::test2 (34 bytes)
Run
Time: 166815209

That's a 4x regression on random input data!

With:

import java.util.Random;

public class TestIntMax {
    private static Random RANDOM = new Random();

    public static void main(String[] args) {
        int[] a = new int[64 * 1024];
        for (int i = 0; i < a.length; i++) {
            a[i] = RANDOM.nextInt();
        }

        {
            System.out.println("Warmup"); 
            for (int i = 0; i < 10_000; i++) { test1(a); }
            System.out.println("Run"); 
            long t0 = System.nanoTime();
            for (int i = 0; i < 10_000; i++) { test1(a); }
            long t1 = System.nanoTime();
            System.out.println("Time: " + (t1 - t0)); 
        }

        {
            System.out.println("Warmup"); 
            for (int i = 0; i < 10_000; i++) { test2(a); }
            System.out.println("Run"); 
            long t0 = System.nanoTime();
            for (int i = 0; i < 10_000; i++) { test2(a); }
            long t1 = System.nanoTime();
            System.out.println("Time: " + (t1 - t0)); 
        }
    }

    public static int test1(int[] a) {
        int x = Integer.MIN_VALUE;
        for (int i = 0; i < a.length; i++) {
            x = Math.max(x, a[i]);
        }
        return x;
    }

    public static int test2(int[] a) {
        int x = Integer.MIN_VALUE;
        for (int i = 0; i < a.length; i++) {
            x = (x >= a[i]) ? x : a[i];
        }
        return x;
    }
}

-------------

PR Comment: https://git.openjdk.org/jdk/pull/20098#issuecomment-2662706564