RFR: 8307513: C2: intrinsify Math.max(long,long) and Math.min(long,long) [v12]
Galder Zamarreño
galder at openjdk.org
Wed Feb 26 11:36:11 UTC 2025
On Fri, 7 Feb 2025 12:39:24 GMT, Galder Zamarreño <galder at openjdk.org> wrote:
>> This patch intrinsifies `Math.max(long, long)` and `Math.min(long, long)` in order to help improve vectorization performance.
>>
>> Currently vectorization does not kick in for loops containing either of these calls because of the following error:
>>
>>
>> VLoop::check_preconditions: failed: control flow in loop not allowed
>>
>>
>> The control flow is due to the java implementation for these methods, e.g.
>>
>>
>> public static long max(long a, long b) {
>> return (a >= b) ? a : b;
>> }
>>
>>
>> This patch intrinsifies the calls to replace the CmpL + Bool nodes for MaxL/MinL nodes respectively.
>> By doing this, vectorization no longer finds the control flow and so it can carry out the vectorization.
>> E.g.
>>
>>
>> SuperWord::transform_loop:
>> Loop: N518/N126 counted [int,int),+4 (1025 iters) main has_sfpt strip_mined
>> 518 CountedLoop === 518 246 126 [[ 513 517 518 242 521 522 422 210 ]] inner stride: 4 main of N518 strip mined !orig=[419],[247],[216],[193] !jvms: Test::test @ bci:14 (line 21)
>>
>>
>> Applying the same changes to `ReductionPerf` as in https://github.com/openjdk/jdk/pull/13056, we can compare the results before and after. Before the patch, on darwin/aarch64 (M1):
>>
>>
>> ==============================
>> Test summary
>> ==============================
>> TEST TOTAL PASS FAIL ERROR
>> jtreg:test/hotspot/jtreg/compiler/loopopts/superword/ReductionPerf.java
>> 1 1 0 0
>> ==============================
>> TEST SUCCESS
>>
>> long min 1155
>> long max 1173
>>
>>
>> After the patch, on darwin/aarch64 (M1):
>>
>>
>> ==============================
>> Test summary
>> ==============================
>> TEST TOTAL PASS FAIL ERROR
>> jtreg:test/hotspot/jtreg/compiler/loopopts/superword/ReductionPerf.java
>> 1 1 0 0
>> ==============================
>> TEST SUCCESS
>>
>> long min 1042
>> long max 1042
>>
>>
>> This patch does not add an platform-specific backend implementations for the MaxL/MinL nodes.
>> Therefore, it still relies on the macro expansion to transform those into CMoveL.
>>
>> I've run tier1 and hotspot compiler tests on darwin/aarch64 and got these results:
>>
>>
>> ==============================
>> Test summary
>> ==============================
>> TEST TOTAL PA...
>
> Galder Zamarreño has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 44 additional commits since the last revision:
>
> - Merge branch 'master' into topic.intrinsify-max-min-long
> - Fix typo
> - Renaming methods and variables and add docu on algorithms
> - Fix copyright years
> - Make sure it runs with cpus with either avx512 or asimd
> - Test can only run with 256 bit registers or bigger
>
> * Remove platform dependant check
> and use platform independent configuration instead.
> - Fix license header
> - Tests should also run on aarch64 asimd=true envs
> - Added comment around the assertions
> - Adjust min/max identity IR test expectations after changes
> - ... and 34 more: https://git.openjdk.org/jdk/compare/abdd4f5e...a190ae68
> > Re: [#20098 (comment)](https://github.com/openjdk/jdk/pull/20098#issuecomment-2671144644) - I was trying to think what could be causing this.
>
> Maybe it is an issue with probabilities? Do you know at what point (if at all) the `MinI` node appears/disappears in that example?
The probabilities are fine.
I think the issue with `Math.min(II)` seems to be specific to when its compilation happens, and the combined fact that the intrinsic has been disabled and vectorization does not kick in (explicitly disabled). Note that other parts of the JDK invoke `Math.min(II)`.
In the slow cases it appears the compilation happens before the benchmark kicks in, and so it takes the profiling data before the benchmark to decide how to compile this in.
In the slow versions you see this `PrintMethodData`:
static java.lang.Math::min(II)I
interpreter_invocation_count: 18171
invocation_counter: 18171
backedge_counter: 0
decompile_count: 0
mdo size: 328 bytes
0 iload_0
1 iload_1
2 if_icmpgt 9
0 bci: 2 BranchData taken(7732) displacement(56)
not taken(10180)
5 iload_0
6 goto 10
32 bci: 6 JumpData taken(10180) displacement(24)
9 iload_1
10 ireturn
org.openjdk.bench.java.lang.MinMaxVector::intReductionSimpleMin(Lorg/openjdk/bench/java/lang/MinMaxVector$LoopState;)I
interpreter_invocation_count: 189
invocation_counter: 189
backedge_counter: 313344
decompile_count: 0
mdo size: 384 bytes
0 iconst_0
1 istore_2
2 iconst_0
3 istore_3
4 iload_3
5 aload_1
6 fast_igetfield 35 <org/openjdk/bench/java/lang/MinMaxVector$LoopState.size:I>
9 if_icmpge 33
0 bci: 9 BranchData taken(58) displacement(72)
not taken(192512)
12 aload_1
13 fast_agetfield 41 <org/openjdk/bench/java/lang/MinMaxVector$LoopState.minIntA:[I>
16 iload_3
17 iaload
18 istore #4
20 iload_2
21 fast_iload #4
23 invokestatic 32 <java/lang/Math.min(II)I>
32 bci: 23 CounterData count(192512)
26 istore_2
27 iinc #3 1
30 goto 4
48 bci: 30 JumpData taken(192512) displacement(-48)
33 iload_2
34 ireturn
The benchmark method calls Math.min `192_512` times, yet the method data shows only `18_171` invocations,
of which `7_732` are taken which is 42%.
So it gets compiled with a `cmov` and the benchmark will be slow because it will branch 100% one of the sides.
In the fast version, `PrintMethodData` looks like this:
static java.lang.Math::min(II)I
interpreter_invocation_count: 1575322
invocation_counter: 1575322
backedge_counter: 0
decompile_count: 0
mdo size: 368 bytes
0 iload_0
1 iload_1
2 if_icmpgt 9
0 bci: 2 BranchData taken(1418001) displacement(56)
not taken(157062)
5 iload_0
6 goto 10
32 bci: 6 JumpData taken(157062) displacement(24)
9 iload_1
10 ireturn
org.openjdk.bench.java.lang.MinMaxVector::intReductionSimpleMin(Lorg/openjdk/bench/java/lang/MinMaxVector$LoopState;)I
interpreter_invocation_count: 858
invocation_counter: 858
backedge_counter: 1756214
decompile_count: 0
mdo size: 424 bytes
0 iconst_0
1 istore_2
2 iconst_0
3 istore_3
4 iload_3
5 aload_1
6 fast_igetfield 35 <org/openjdk/bench/java/lang/MinMaxVector$LoopState.size:I>
9 if_icmpge 33
0 bci: 9 BranchData taken(733) displacement(72)
not taken(1637363)
12 aload_1
13 fast_agetfield 41 <org/openjdk/bench/java/lang/MinMaxVector$LoopState.minIntA:[I>
16 iload_3
17 iaload
18 istore #4
20 iload_2
21 fast_iload #4
23 invokestatic 32 <java/lang/Math.min(II)I>
32 bci: 23 CounterData count(1637363)
26 istore_2
27 iinc #3 1
30 goto 4
48 bci: 30 JumpData taken(1637363) displacement(-48)
33 iload_2
34 ireturn
The benchmark method calls Math.min `1_637_363` times, and the method data shows `1_575_322` invocations,
of which `1_418_001` are taken which is 90%. So no cmov is introduced and the benchmark will be fast because it will branch 100% one of the sides.
A factor here might be my Xeon machine. I run the benchmark on a 4 core VM inside it, so given the limited resources compilation can take longer. I've noticed that it's easier to replicate this scenario there rather than my M1 laptop, which has 10 cores.
>> So, if those int scalar regressions were not a problem when int min/max intrinsic was added, I would expect the same to apply to long.
>
> Do you know when they were added? If that was a long time ago, we might not have noticed back then, but we might notice now.
I don't know when they were added.
> That said: if we know that it is only in the high-probability cases, then we can address those separately. I would not consider it a blocking issue, as long as we file the follow-up RFE for int/max scalar case with high branch probability.
>
> What would be really helpful: a list of all regressions / issues, and how we intend to deal with them. If we later find a regression that someone cares about, then we can come back to that list, and justify the decision we made here.
I'll make up a list of regressions and post it here. I won't create RFEs for now. I'd rather wait until we have the list in front of us and we can decide which RFEs to create.
-------------
PR Comment: https://git.openjdk.org/jdk/pull/20098#issuecomment-2684701935
More information about the core-libs-dev
mailing list