RFR: 8307513: C2: intrinsify Math.max(long,long) and Math.min(long,long) [v12]
Galder Zamarreño
galder at openjdk.org
Wed Feb 19 17:42:08 UTC 2025
On Fri, 7 Feb 2025 12:39:24 GMT, Galder Zamarreño <galder at openjdk.org> wrote:
>> This patch intrinsifies `Math.max(long, long)` and `Math.min(long, long)` in order to help improve vectorization performance.
>>
>> Currently vectorization does not kick in for loops containing either of these calls because of the following error:
>>
>>
>> VLoop::check_preconditions: failed: control flow in loop not allowed
>>
>>
>> The control flow is due to the java implementation for these methods, e.g.
>>
>>
>> public static long max(long a, long b) {
>> return (a >= b) ? a : b;
>> }
>>
>>
>> This patch intrinsifies the calls to replace the CmpL + Bool nodes for MaxL/MinL nodes respectively.
>> By doing this, vectorization no longer finds the control flow and so it can carry out the vectorization.
>> E.g.
>>
>>
>> SuperWord::transform_loop:
>> Loop: N518/N126 counted [int,int),+4 (1025 iters) main has_sfpt strip_mined
>> 518 CountedLoop === 518 246 126 [[ 513 517 518 242 521 522 422 210 ]] inner stride: 4 main of N518 strip mined !orig=[419],[247],[216],[193] !jvms: Test::test @ bci:14 (line 21)
>>
>>
>> Applying the same changes to `ReductionPerf` as in https://github.com/openjdk/jdk/pull/13056, we can compare the results before and after. Before the patch, on darwin/aarch64 (M1):
>>
>>
>> ==============================
>> Test summary
>> ==============================
>> TEST TOTAL PASS FAIL ERROR
>> jtreg:test/hotspot/jtreg/compiler/loopopts/superword/ReductionPerf.java
>> 1 1 0 0
>> ==============================
>> TEST SUCCESS
>>
>> long min 1155
>> long max 1173
>>
>>
>> After the patch, on darwin/aarch64 (M1):
>>
>>
>> ==============================
>> Test summary
>> ==============================
>> TEST TOTAL PASS FAIL ERROR
>> jtreg:test/hotspot/jtreg/compiler/loopopts/superword/ReductionPerf.java
>> 1 1 0 0
>> ==============================
>> TEST SUCCESS
>>
>> long min 1042
>> long max 1042
>>
>>
>> This patch does not add an platform-specific backend implementations for the MaxL/MinL nodes.
>> Therefore, it still relies on the macro expansion to transform those into CMoveL.
>>
>> I've run tier1 and hotspot compiler tests on darwin/aarch64 and got these results:
>>
>>
>> ==============================
>> Test summary
>> ==============================
>> TEST TOTAL PA...
>
> Galder Zamarreño has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 44 additional commits since the last revision:
>
> - Merge branch 'master' into topic.intrinsify-max-min-long
> - Fix typo
> - Renaming methods and variables and add docu on algorithms
> - Fix copyright years
> - Make sure it runs with cpus with either avx512 or asimd
> - Test can only run with 256 bit registers or bigger
>
> * Remove platform dependant check
> and use platform independent configuration instead.
> - Fix license header
> - Tests should also run on aarch64 asimd=true envs
> - Added comment around the assertions
> - Adjust min/max identity IR test expectations after changes
> - ... and 34 more: https://git.openjdk.org/jdk/compare/75abfbc2...a190ae68
Following our discussion, I've run `MinMaxVector.long` benchmarks with superword disabled and with/without `_maxL` intrinsic in both AVX-512 and AVX2 modes.
The first thing I've observed is that lacking superword, the results with AVX-512 or AVX2 are identical, so I will just focus on AVX-512 results below.
Benchmark (probability) (range) (seed) (size) Mode Cnt -maxL +maxLr Units
MinMaxVector.longClippingRange N/A 90 0 1000 thrpt 4 1012.017 1011.8109 ops/ms
MinMaxVector.longClippingRange N/A 100 0 1000 thrpt 4 1012.113 1011.9530 ops/ms
MinMaxVector.longLoopMax 50 N/A N/A 2048 thrpt 4 463.946 473.9408 ops/ms
MinMaxVector.longLoopMax 80 N/A N/A 2048 thrpt 4 465.391 473.8063 ops/ms
MinMaxVector.longLoopMax 100 N/A N/A 2048 thrpt 4 510.992 471.6280 ops/ms (-8%)
MinMaxVector.longLoopMin 50 N/A N/A 2048 thrpt 4 496.036 495.3142 ops/ms
MinMaxVector.longLoopMin 80 N/A N/A 2048 thrpt 4 495.797 497.1214 ops/ms
MinMaxVector.longLoopMin 100 N/A N/A 2048 thrpt 4 495.302 495.1535 ops/ms
MinMaxVector.longReductionMultiplyMax 50 N/A N/A 2048 thrpt 4 405.495 405.3936 ops/ms
MinMaxVector.longReductionMultiplyMax 80 N/A N/A 2048 thrpt 4 405.342 405.4505 ops/ms
MinMaxVector.longReductionMultiplyMax 100 N/A N/A 2048 thrpt 4 846.492 405.4779 ops/ms (-52%)
MinMaxVector.longReductionMultiplyMin 50 N/A N/A 2048 thrpt 4 414.755 414.7036 ops/ms
MinMaxVector.longReductionMultiplyMin 80 N/A N/A 2048 thrpt 4 414.705 414.7093 ops/ms
MinMaxVector.longReductionMultiplyMin 100 N/A N/A 2048 thrpt 4 414.761 414.7150 ops/ms
MinMaxVector.longReductionSimpleMax 50 N/A N/A 2048 thrpt 4 460.435 460.3764 ops/ms
MinMaxVector.longReductionSimpleMax 80 N/A N/A 2048 thrpt 4 460.438 460.4718 ops/ms
MinMaxVector.longReductionSimpleMax 100 N/A N/A 2048 thrpt 4 1023.005 460.5417 ops/ms (-55%)
MinMaxVector.longReductionSimpleMin 50 N/A N/A 2048 thrpt 4 459.184 459.1662 ops/ms
MinMaxVector.longReductionSimpleMin 80 N/A N/A 2048 thrpt 4 459.265 459.2588 ops/ms
MinMaxVector.longReductionSimpleMin 100 N/A N/A 2048 thrpt 4 459.263 459.1304 ops/ms
`longLoopMax at 100%`, `longReductionMultiplyMax at 100%` and `longReductionSimpleMax at 100%` are regressions with the `_maxL` intrinsic. The cause is familiar: without the intrinsic cmp+mov are emitted, while with the intrinsic and conditions above, `cmov` is emitted:
# `longLoopMax` @ 100%
-maxL:
4.18% ││││ │││ │ 0x00007fb7580f84b2: cmpq %r13, %r11
││││╭ │││ │ 0x00007fb7580f84b5: jl 0x7fb7580f84ec ;*lreturn {reexecute=0 rethrow=0 return_oop=0}
│││││ │││ │ ; - java.lang.Math::max at 11 (line 2038)
│││││ │││ │ ; - org.openjdk.bench.java.lang.MinMaxVector::longLoopMax at 27 (line 256)
│││││ │││ │ ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longLoopMax_jmhTest::longLoopMax_thrpt_jmhStub at 19 (line 124)
4.23% │││││ │││↗ │ 0x00007fb7580f84bb: movq %r11, 0x10(%rbp, %rsi, 8);*lastore {reexecute=0 rethrow=0 return_oop=0}
│││││ ││││ │ ; - org.openjdk.bench.java.lang.MinMaxVector::longLoopMax at 30 (line 256)
│││││ ││││ │ ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longLoopMax_jmhTest::longLoopMax_thrpt_jmhStub at 19 (line 124)
+maxL:
1.06% │││ 0x00007fe1b40f5ed1: movq 0x20(%rbx, %r10, 8), %r14;*laload {reexecute=0 rethrow=0 return_oop=0}
│││ ; - org.openjdk.bench.java.lang.MinMaxVector::longLoopMax at 26 (line 256)
│││ ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longLoopMax_jmhTest::longLoopMax_thrpt_jmhStub at 19 (line 124)
1.34% │││ 0x00007fe1b40f5ed6: cmpq %r14, %r9
2.78% │││ 0x00007fe1b40f5ed9: cmovlq %r14, %r9
2.58% │││ 0x00007fe1b40f5edd: movq %r9, 0x20(%rax, %r10, 8);*lastore {reexecute=0 rethrow=0 return_oop=0}
│││ ; - org.openjdk.bench.java.lang.MinMaxVector::longLoopMax at 30 (line 256)
│││ ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longLoopMax_jmhTest::longLoopMax_thrpt_jmhStub at 19 (line 124)
# `longReductionMultiplyMax` @ 100%
-maxL:
6.71% ││ ││↗ 0x00007f8af40f6278: imulq $0xb, 0x18(%r14, %r8, 8), %rdx
││ │││ ;*lmul {reexecute=0 rethrow=0 return_oop=0}
││ │││ ; - org.openjdk.bench.java.lang.MinMaxVector::longReductionMultiplyMax at 24 (line 285)
││ │││ ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longReductionMultiplyMax_jmhTest::longReductionMultiplyMax_thrpt_jmhStub at 19 (line 124)
5.28% ││ │││ 0x00007f8af40f627e: nop
10.23% ││ │││ 0x00007f8af40f6280: cmpq %rdx, %rdi
││╭ │││ 0x00007f8af40f6283: jge 0x7f8af40f62a7 ;*lreturn {reexecute=0 rethrow=0 return_oop=0}
│││ │││ ; - java.lang.Math::max at 11 (line 2038)
│││ │││ ; - org.openjdk.bench.java.lang.MinMaxVector::longReductionMultiplyMax at 30 (line 286)
│││ │││ ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longReductionMultiplyMax_jmhTest::longReductionMultiplyMax_thrpt_jmhStub at 19 (line 124)
+maxL:
11.07% ││ 0x00007f47000f5c4d: imulq $0xb, 0x18(%r14, %r11, 8), %rax
││ ;*lmul {reexecute=0 rethrow=0 return_oop=0}
││ ; - org.openjdk.bench.java.lang.MinMaxVector::longReductionMultiplyMax at 24 (line 285)
││ ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longReductionMultiplyMax_jmhTest::longReductionMultiplyMax_thrpt_jmhStub at 19 (line 124)
0.07% ││ 0x00007f47000f5c53: cmpq %rdx, %rax
11.87% ││ 0x00007f47000f5c56: cmovlq %rdx, %rax ;*invokestatic max {reexecute=0 rethrow=0 return_oop=0}
││ ; - org.openjdk.bench.java.lang.MinMaxVector::longReductionMultiplyMax at 30 (line 286)
││ ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longReductionMultiplyMax_jmhTest::longReductionMultiplyMax_thrpt_jmhStub at 19 (line 124)
# `longReductionSimpleMax` @ 100%
-maxL:
5.71% │││││ │││↗ │ 0x00007fc2380f75f9: movq 0x20(%r14, %r8, 8), %rdi;*laload {reexecute=0 rethrow=0 return_oop=0}
│││││ ││││ │ ; - org.openjdk.bench.java.lang.MinMaxVector::longReductionSimpleMax at 20 (line 295)
│││││ ││││ │ ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longReductionSimpleMax_jmhTest::longReductionSimpleMax_thrpt_jmhStub at 19 (line 124)
1.85% │││││ ││││ │ 0x00007fc2380f75fe: nop
4.52% │││││ ││││ │ 0x00007fc2380f7600: cmpq %rdi, %rdx
│││││╭ ││││ │ 0x00007fc2380f7603: jge 0x7fc2380f7667 ;*lreturn {reexecute=0 rethrow=0 return_oop=0}
││││││ ││││ │ ; - java.lang.Math::max at 11 (line 2038)
││││││ ││││ │ ; - org.openjdk.bench.java.lang.MinMaxVector::longReductionSimpleMax at 26 (line 296)
││││││ ││││ │ ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longReductionSimpleMax_jmhTest::longReductionSimpleMax_thrpt_jmhStub at 19 (line 124)
+maxL:
3.06% ││││││ 0x00007fa6d00f6020: movq 0x70(%r14, %r11, 8), %r8;*laload {reexecute=0 rethrow=0 return_oop=0}
││││││ ; - org.openjdk.bench.java.lang.MinMaxVector::longReductionSimpleMax at 20 (line 295)
││││││ ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longReductionSimpleMax_jmhTest::longReductionSimpleMax_thrpt_jmhStub at 19 (line 124)
││││││ 0x00007fa6d00f6025: cmpq %r8, %r13
2.88% ││││││ 0x00007fa6d00f6028: cmovlq %r8, %r13 ;*invokestatic max {reexecute=0 rethrow=0 return_oop=0}
││││││ ; - org.openjdk.bench.java.lang.MinMaxVector::longReductionSimpleMax at 26 (line 296)
││││││ ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longReductionSimpleMax_jmhTest::longReductionSimpleMax_thrpt_jmhStub at 19 (line 124)
-------------
PR Comment: https://git.openjdk.org/jdk/pull/20098#issuecomment-2669329851
More information about the core-libs-dev
mailing list