RFR: 8307513: C2: intrinsify Math.max(long,long) and Math.min(long,long) [v11]
Galder Zamarreño
galder at openjdk.org
Mon Feb 17 16:49:15 UTC 2025
On Mon, 17 Feb 2025 15:02:32 GMT, Roland Westrelin <roland at openjdk.org> wrote:
>> @rwestrel @galderz
>>
>>> It seems overall, we likely win more than we loose with this intrinsic, so I would integrate this change as it is and file a bug to keep track of remaining issues.
>>
>> I'm a little scared to just accept the regressions, especially for this "most average looking case":
>> Imagine you have an array with random numbers. Or at least numbers in a random order. If we take the max, then we expect the first number to be max with probability 1, the second 1/2, the third 1/3, the i'th 1/i. So the average branch probability is `n / (sum_i 1/i)`. This goes closer and closer to zero, the larger the array. This means that the "average" case has an extreme probability. And so if we do not vectorize, then this gets us a regression with the current patch. And vectorization is a little fragile, it only takes very little for vectorization not to kick in.
>>
>>> The Min/Max nodes are floating nodes. They can hoist out of loop and common reliably in ways that are not guaranteed otherwise.
>>
>> I suppose we could write an optimization that can hoist out loop independent if-diamonds out of a loop. If the condition and all phi inputs are loop invariant, you could just cut the diamond out of the loop, and paste it before the loop entry.
>>
>>> Shouldn't int min/max be affected the same way?
>>
>> I think we should be able to see the same issue here, actually. Yes. Here a quick benchmark below:
>>
>>
>> java -XX:CompileCommand=compileonly,TestIntMax::test* -XX:CompileCommand=printcompilation,TestIntMax::test* -XX:+TraceNewVectors TestIntMax.java
>> CompileCommand: compileonly TestIntMax.test* bool compileonly = true
>> CompileCommand: PrintCompilation TestIntMax.test* bool PrintCompilation = true
>> Warmup
>> 5225 93 % 3 TestIntMax::test1 @ 5 (27 bytes)
>> 5226 94 3 TestIntMax::test1 (27 bytes)
>> 5226 95 % 4 TestIntMax::test1 @ 5 (27 bytes)
>> 5238 96 4 TestIntMax::test1 (27 bytes)
>> Run
>> Time: 542056319
>> Warmup
>> 6320 101 % 3 TestIntMax::test2 @ 5 (34 bytes)
>> 6322 102 % 4 TestIntMax::test2 @ 5 (34 bytes)
>> 6329 103 4 TestIntMax::test2 (34 bytes)
>> Run
>> Time: 166815209
>>
>> That's a 4x regression on random input data!
>>
>> With:
>>
>> import java.util.Random;
>>
>> public class TestIntMax {
>> private static Random RANDOM = new Random();
>>
>> public static void main(String[] args) {
>> int[] a = new int[64 * 1024];
>> for (int i = 0; i < a.length; i++) {
>>...
>
>> I think we should be able to see the same issue here, actually. Yes. Here a quick benchmark below:
>
> I observe the same:
>
>
> Warmup
> 751 3 b TestIntMax::test1 (27 bytes)
> Run
> Time: 360 550 158
> Warmup
> 1862 15 b TestIntMax::test2 (34 bytes)
> Run
> Time: 92 116 170
>
>
> But then with this:
>
>
> diff --git a/src/hotspot/cpu/x86/x86_64.ad b/src/hotspot/cpu/x86/x86_64.ad
> index 8cc4a970bfd..9abda8f4178 100644
> --- a/src/hotspot/cpu/x86/x86_64.ad
> +++ b/src/hotspot/cpu/x86/x86_64.ad
> @@ -12037,16 +12037,20 @@ instruct cmovI_reg_l(rRegI dst, rRegI src, rFlagsReg cr)
> %}
>
>
> -instruct maxI_rReg(rRegI dst, rRegI src)
> +instruct maxI_rReg(rRegI dst, rRegI src, rFlagsReg cr)
> %{
> match(Set dst (MaxI dst src));
> + effect(KILL cr);
>
> ins_cost(200);
> - expand %{
> - rFlagsReg cr;
> - compI_rReg(cr, dst, src);
> - cmovI_reg_l(dst, src, cr);
> + ins_encode %{
> + Label done;
> + __ cmpl($src$$Register, $dst$$Register);
> + __ jccb(Assembler::less, done);
> + __ mov($dst$$Register, $src$$Register);
> + __ bind(done);
> %}
> + ins_pipe(pipe_cmov_reg);
> %}
>
> // ============================================================================
>
>
> the performance gap narrows:
>
>
> Warmup
> 770 3 b TestIntMax::test1 (27 bytes)
> Run
> Time: 94 951 677
> Warmup
> 1312 15 b TestIntMax::test2 (34 bytes)
> Run
> Time: 70 053 824
>
>
> (the number of test2 fluctuates quite a bit). Does it ever make sense to implement `MaxI` with a conditional move then?
@rwestrel @eme64 I think that the data distribution in the `TestIntMax` above matters (see my explanations in https://github.com/openjdk/jdk/pull/20098#issuecomment-2642788364), so I've enhanced the test to control data distribution in the int[] (see at the bottom).
Here are the results I see on my AVX-512 machine:
Probability: 50%
Warmup
7834 92 % b 3 TestIntMax::test1 @ 5 (27 bytes)
7836 93 b 3 TestIntMax::test1 (27 bytes)
7838 94 % b 4 TestIntMax::test1 @ 5 (27 bytes)
7851 95 b 4 TestIntMax::test1 (27 bytes)
Run
Time: 699 923 014
Warmup
9272 96 % b 3 TestIntMax::test2 @ 5 (34 bytes)
9274 97 b 3 TestIntMax::test2 (34 bytes)
9275 98 % b 4 TestIntMax::test2 @ 5 (34 bytes)
9287 99 b 4 TestIntMax::test2 (34 bytes)
Run
Time: 699 815 792
Probability: 80%
Warmup
7872 92 % b 3 TestIntMax::test1 @ 5 (27 bytes)
7874 93 b 3 TestIntMax::test1 (27 bytes)
7875 94 % b 4 TestIntMax::test1 @ 5 (27 bytes)
7889 95 b 4 TestIntMax::test1 (27 bytes)
Run
Time: 699 947 633
Warmup
9310 96 % b 3 TestIntMax::test2 @ 5 (34 bytes)
9311 97 b 3 TestIntMax::test2 (34 bytes)
9312 98 % b 4 TestIntMax::test2 @ 5 (34 bytes)
9325 99 b 4 TestIntMax::test2 (34 bytes)
Run
Time: 699 827 882
Probability: 100%
Warmup
7884 92 % b 3 TestIntMax::test1 @ 5 (27 bytes)
7886 93 b 3 TestIntMax::test1 (27 bytes)
7888 94 % b 4 TestIntMax::test1 @ 5 (27 bytes)
7901 95 b 4 TestIntMax::test1 (27 bytes)
Run
Time: 699 931 243
Warmup
9322 96 % b 3 TestIntMax::test2 @ 5 (34 bytes)
9323 97 b 3 TestIntMax::test2 (34 bytes)
9324 98 % b 4 TestIntMax::test2 @ 5 (34 bytes)
9336 99 b 4 TestIntMax::test2 (34 bytes)
Run
Time: 1 077 937 282
import java.util.Random;
import java.util.concurrent.ThreadLocalRandom;
import java.text.DecimalFormat;
import java.text.DecimalFormatSymbols;
class TestIntMax
{
static final int RANGE = 16 * 1024;
static final int ITER = 100_000;
public static void main(String[] args)
{
final int probability = Integer.parseInt(args[0]);
final DecimalFormatSymbols symbols = new DecimalFormatSymbols();
symbols.setGroupingSeparator(' ');
final DecimalFormat format = new DecimalFormat("#,###", symbols);
System.out.printf("Probability: %d%%%n", probability);
int[] a = new int[64 * 1024];
init(a, probability);
{
System.out.println("Warmup");
for (int i = 0; i < 10_000; i++)
{
test1(a);
}
System.out.println("Run");
long t0 = System.nanoTime();
for (int i = 0; i < 10_000; i++)
{
test1(a);
}
long t1 = System.nanoTime();
System.out.println("Time: " + format.format(t1 - t0));
}
{
System.out.println("Warmup");
for (int i = 0; i < 10_000; i++)
{
test2(a);
}
System.out.println("Run");
long t0 = System.nanoTime();
for (int i = 0; i < 10_000; i++)
{
test2(a);
}
long t1 = System.nanoTime();
System.out.println("Time: " + format.format(t1 - t0));
}
}
public static int test1(int[] a)
{
int x = Integer.MIN_VALUE;
for (int i = 0; i < a.length; i++)
{
x = Math.max(x, a[i]);
}
return x;
}
public static int test2(int[] a)
{
int x = Integer.MIN_VALUE;
for (int i = 0; i < a.length; i++)
{
x = (x >= a[i]) ? x : a[i];
}
return x;
}
public static void init(int[] ints, int probability)
{
int aboveCount, abovePercent;
do
{
int max = ThreadLocalRandom.current().nextInt(10);
ints[0] = max;
aboveCount = 0;
for (int i = 1; i < ints.length; i++)
{
int value;
if (ThreadLocalRandom.current().nextInt(101) <= probability)
{
int increment = ThreadLocalRandom.current().nextInt(10);
value = max + increment;
aboveCount++;
}
else
{
// Decrement by at least 1
int decrement = ThreadLocalRandom.current().nextInt(10) + 1;
value = max - decrement;
}
ints[i] = value;
max = Math.max(max, value);
}
abovePercent = ((aboveCount + 1) * 100) / ints.length;
} while (abovePercent != probability);
}
}
Focusing my comment below on 100% which is where the differences appear:
test2 (100%):
;; B12: # out( B21 B13 ) <- in( B11 B20 ) Freq: 1.6744e+09
0x00007f15bcada2e9: movl 0x14(%rsi, %rdx, 4), %r11d
;*iaload {reexecute=0 rethrow=0 return_oop=0}
; - TestIntMax::test2 at 14 (line 71)
0x00007f15bcada2ee: cmpl %r11d, %r10d
0x00007f15bcada2f1: jge 0x7f15bcada362 ;*istore_1 {reexecute=0 rethrow=0 return_oop=0}
; - TestIntMax::test2 at 25 (line 71)
test1 (100%)
;; B10: # out( B10 B11 ) <- in( B9 B10 ) Loop( B10-B10 inner main of N64 strip mined) Freq: 1.6744e+09
0x00007f15bcad9a70: movl 0x4c(%rsi, %rdx, 4), %r11d
0x00007f15bcad9a75: movl %r11d, (%rsp)
0x00007f15bcad9a79: movl 0x48(%rsi, %rdx, 4), %r10d
0x00007f15bcad9a7e: movl %r10d, 4(%rsp)
0x00007f15bcad9a83: movl 0x10(%rsi, %rdx, 4), %r11d
0x00007f15bcad9a88: movl 0x14(%rsi, %rdx, 4), %r9d
0x00007f15bcad9a8d: movl 0x44(%rsi, %rdx, 4), %r10d
0x00007f15bcad9a92: movl %r10d, 8(%rsp)
0x00007f15bcad9a97: movl 0x18(%rsi, %rdx, 4), %r8d
0x00007f15bcad9a9c: cmpl %r11d, %eax
0x00007f15bcad9a9f: cmovll %r11d, %eax
0x00007f15bcad9aa3: cmpl %r9d, %eax
0x00007f15bcad9aa6: cmovll %r9d, %eax
0x00007f15bcad9aaa: movl 0x20(%rsi, %rdx, 4), %r10d
0x00007f15bcad9aaf: cmpl %r8d, %eax
0x00007f15bcad9ab2: cmovll %r8d, %eax
0x00007f15bcad9ab6: movl 0x24(%rsi, %rdx, 4), %r8d
0x00007f15bcad9abb: movl 0x28(%rsi, %rdx, 4), %r11d
; {no_reloc}
0x00007f15bcad9ac0: movl 0x2c(%rsi, %rdx, 4), %ecx
0x00007f15bcad9ac4: movl 0x30(%rsi, %rdx, 4), %r9d
0x00007f15bcad9ac9: movl 0x34(%rsi, %rdx, 4), %edi
0x00007f15bcad9acd: movl 0x38(%rsi, %rdx, 4), %ebx
0x00007f15bcad9ad1: movl 0x3c(%rsi, %rdx, 4), %ebp
0x00007f15bcad9ad5: movl 0x40(%rsi, %rdx, 4), %r13d
0x00007f15bcad9ada: movl 0x1c(%rsi, %rdx, 4), %r14d
0x00007f15bcad9adf: cmpl %r14d, %eax
0x00007f15bcad9ae2: cmovll %r14d, %eax
0x00007f15bcad9ae6: cmpl %r10d, %eax
0x00007f15bcad9ae9: cmovll %r10d, %eax
0x00007f15bcad9aed: cmpl %r8d, %eax
0x00007f15bcad9af0: cmovll %r8d, %eax
0x00007f15bcad9af4: cmpl %r11d, %eax
0x00007f15bcad9af7: cmovll %r11d, %eax
0x00007f15bcad9afb: cmpl %ecx, %eax
0x00007f15bcad9afd: cmovll %ecx, %eax
0x00007f15bcad9b00: cmpl %r9d, %eax
0x00007f15bcad9b03: cmovll %r9d, %eax
0x00007f15bcad9b07: cmpl %edi, %eax
0x00007f15bcad9b09: cmovll %edi, %eax
0x00007f15bcad9b0c: cmpl %ebx, %eax
0x00007f15bcad9b0e: cmovll %ebx, %eax
0x00007f15bcad9b11: cmpl %ebp, %eax
0x00007f15bcad9b13: cmovll %ebp, %eax
0x00007f15bcad9b16: cmpl %r13d, %eax
0x00007f15bcad9b19: cmovll %r13d, %eax
0x00007f15bcad9b1d: cmpl 8(%rsp), %eax
0x00007f15bcad9b21: movl 8(%rsp), %r11d
0x00007f15bcad9b26: cmovll %r11d, %eax
0x00007f15bcad9b2a: cmpl 4(%rsp), %eax
0x00007f15bcad9b2e: movl 4(%rsp), %r10d
0x00007f15bcad9b33: cmovll %r10d, %eax
0x00007f15bcad9b37: cmpl (%rsp), %eax
0x00007f15bcad9b3a: movl (%rsp), %r11d
0x00007f15bcad9b3e: cmovll %r11d, %eax ;*invokestatic max {reexecute=0 rethrow=0 return_oop=0}
; - TestIntMax::test1 at 15 (line 61)
-------------
PR Comment: https://git.openjdk.org/jdk/pull/20098#issuecomment-2663633050
More information about the core-libs-dev
mailing list