RFR: 8307513: C2: intrinsify Math.max(long,long) and Math.min(long,long) [v12]
Galder Zamarreño
galder at openjdk.org
Thu Feb 20 06:53:04 UTC 2025
On Fri, 7 Feb 2025 12:39:24 GMT, Galder Zamarreño <galder at openjdk.org> wrote:
>> This patch intrinsifies `Math.max(long, long)` and `Math.min(long, long)` in order to help improve vectorization performance.
>>
>> Currently vectorization does not kick in for loops containing either of these calls because of the following error:
>>
>>
>> VLoop::check_preconditions: failed: control flow in loop not allowed
>>
>>
>> The control flow is due to the java implementation for these methods, e.g.
>>
>>
>> public static long max(long a, long b) {
>> return (a >= b) ? a : b;
>> }
>>
>>
>> This patch intrinsifies the calls to replace the CmpL + Bool nodes for MaxL/MinL nodes respectively.
>> By doing this, vectorization no longer finds the control flow and so it can carry out the vectorization.
>> E.g.
>>
>>
>> SuperWord::transform_loop:
>> Loop: N518/N126 counted [int,int),+4 (1025 iters) main has_sfpt strip_mined
>> 518 CountedLoop === 518 246 126 [[ 513 517 518 242 521 522 422 210 ]] inner stride: 4 main of N518 strip mined !orig=[419],[247],[216],[193] !jvms: Test::test @ bci:14 (line 21)
>>
>>
>> Applying the same changes to `ReductionPerf` as in https://github.com/openjdk/jdk/pull/13056, we can compare the results before and after. Before the patch, on darwin/aarch64 (M1):
>>
>>
>> ==============================
>> Test summary
>> ==============================
>> TEST TOTAL PASS FAIL ERROR
>> jtreg:test/hotspot/jtreg/compiler/loopopts/superword/ReductionPerf.java
>> 1 1 0 0
>> ==============================
>> TEST SUCCESS
>>
>> long min 1155
>> long max 1173
>>
>>
>> After the patch, on darwin/aarch64 (M1):
>>
>>
>> ==============================
>> Test summary
>> ==============================
>> TEST TOTAL PASS FAIL ERROR
>> jtreg:test/hotspot/jtreg/compiler/loopopts/superword/ReductionPerf.java
>> 1 1 0 0
>> ==============================
>> TEST SUCCESS
>>
>> long min 1042
>> long max 1042
>>
>>
>> This patch does not add an platform-specific backend implementations for the MaxL/MinL nodes.
>> Therefore, it still relies on the macro expansion to transform those into CMoveL.
>>
>> I've run tier1 and hotspot compiler tests on darwin/aarch64 and got these results:
>>
>>
>> ==============================
>> Test summary
>> ==============================
>> TEST TOTAL PA...
>
> Galder Zamarreño has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 44 additional commits since the last revision:
>
> - Merge branch 'master' into topic.intrinsify-max-min-long
> - Fix typo
> - Renaming methods and variables and add docu on algorithms
> - Fix copyright years
> - Make sure it runs with cpus with either avx512 or asimd
> - Test can only run with 256 bit registers or bigger
>
> * Remove platform dependant check
> and use platform independent configuration instead.
> - Fix license header
> - Tests should also run on aarch64 asimd=true envs
> - Added comment around the assertions
> - Adjust min/max identity IR test expectations after changes
> - ... and 34 more: https://git.openjdk.org/jdk/compare/af7645e5...a190ae68
To follow up https://github.com/openjdk/jdk/pull/20098#issuecomment-2669329851, I've run `MinMaxVector.int` benchmarks with **superword disabled** and with/without `_max`/`_min` intrinsics in both AVX-512 and AVX2 modes.
# AVX-512
Benchmark (probability) (range) (seed) (size) Mode Cnt -min/-max +min/+max Units
MinMaxVector.intClippingRange N/A 90 0 1000 thrpt 4 1067.050 1038.640 ops/ms
MinMaxVector.intClippingRange N/A 100 0 1000 thrpt 4 1041.922 1039.004 ops/ms
MinMaxVector.intLoopMax 50 N/A N/A 2048 thrpt 4 605.173 604.337 ops/ms
MinMaxVector.intLoopMax 80 N/A N/A 2048 thrpt 4 605.106 604.309 ops/ms
MinMaxVector.intLoopMax 100 N/A N/A 2048 thrpt 4 604.547 604.432 ops/ms
MinMaxVector.intLoopMin 50 N/A N/A 2048 thrpt 4 495.042 605.216 ops/ms (+22%)
MinMaxVector.intLoopMin 80 N/A N/A 2048 thrpt 4 495.105 495.217 ops/ms
MinMaxVector.intLoopMin 100 N/A N/A 2048 thrpt 4 495.040 495.176 ops/ms
MinMaxVector.intReductionMultiplyMax 50 N/A N/A 2048 thrpt 4 407.920 407.984 ops/ms
MinMaxVector.intReductionMultiplyMax 80 N/A N/A 2048 thrpt 4 407.710 407.965 ops/ms
MinMaxVector.intReductionMultiplyMax 100 N/A N/A 2048 thrpt 4 874.881 407.922 ops/ms (-53%)
MinMaxVector.intReductionMultiplyMin 50 N/A N/A 2048 thrpt 4 407.911 407.947 ops/ms
MinMaxVector.intReductionMultiplyMin 80 N/A N/A 2048 thrpt 4 408.015 408.024 ops/ms
MinMaxVector.intReductionMultiplyMin 100 N/A N/A 2048 thrpt 4 407.978 407.994 ops/ms
MinMaxVector.intReductionSimpleMax 50 N/A N/A 2048 thrpt 4 460.538 460.439 ops/ms
MinMaxVector.intReductionSimpleMax 80 N/A N/A 2048 thrpt 4 460.579 460.542 ops/ms
MinMaxVector.intReductionSimpleMax 100 N/A N/A 2048 thrpt 4 998.211 460.404 ops/ms (-53%)
MinMaxVector.intReductionSimpleMin 50 N/A N/A 2048 thrpt 4 460.570 460.447 ops/ms
MinMaxVector.intReductionSimpleMin 80 N/A N/A 2048 thrpt 4 460.552 460.493 ops/ms
MinMaxVector.intReductionSimpleMin 100 N/A N/A 2048 thrpt 4 460.455 460.485 ops/ms
There is some improvement in `intLoopMin` @ 50% but this didn't materialize in the `perfasm` run, so I don't think it can be strictly be correlated with the use/not-use of the intrinsic.
`intReductionMultiplyMax` and `intReductionSimpleMax` @ 100% regressions with the `max` intrinsic activated are consistent with what the saw with long.
### `intReductionMultiplyMin` and `intReductionSimpleMin` @ 100% same performance
There is something very intriguing happening here, which I don't know it's due to min itself or int vs long. Basically, with or without the `min` intrinsic the performance of these 2 benchmarks is same at 100% branch probability. What is going on? Let's look at one of them:
-min
# VM options: -Djava.library.path=/home/vagrant/1/jdk-intrinsify-max-min-long/build/release-linux-x86_64/images/test/micro/native -XX:+UnlockDiagnosticVMOptions -XX:DisableIntrinsic=_max,_min -XX:-UseSuperWord
...
3.04% ││││ │ 0x00007f49280f76e9: cmpl %edi, %r10d
3.14% ││││ │ 0x00007f49280f76ec: cmovgl %edi, %r10d ;*ireturn {reexecute=0 rethrow=0 return_oop=0}
││││ │ ; - java.lang.Math::min at 10 (line 2119)
││││ │ ; - org.openjdk.bench.java.lang.MinMaxVector::intReductionSimpleMin at 23 (line 212)
││││ │ ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_intReductionSimpleMin_jmhTest::intReductionSimpleMin_thrpt_jmhStub at 19 (line 124)
+min
# VM options: -Djava.library.path=/home/vagrant/1/jdk-intrinsify-max-min-long/build/release-linux-x86_64/images/test/micro/native -XX:-UseSuperWord
...
3.10% ││ │ 0x00007fbf340f6b97: cmpl %edi, %r10d
3.08% ││ │ 0x00007fbf340f6b9a: cmovgl %edi, %r10d ;*invokestatic min {reexecute=0 rethrow=0 return_oop=0}
││ │ ; - org.openjdk.bench.java.lang.MinMaxVector::intReductionSimpleMin at 23 (line 212)
││ │ ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_intReductionSimpleMin_jmhTest::intReductionSimpleMin_thrpt_jmhStub at 19 (line 124)
Both are `cmov`. You can see how without the intrinsic the `Math::min` bytecode gets executed and transformed into a `cmov` and the same with the intrinsic.
I will verify this with long shortly to see if this behaviour is specific to `min` operation or something to do with int vs long.
# AVX2
Here are the AVX2 numbers:
Benchmark (probability) (range) (seed) (size) Mode Cnt -min/-max +min/+max Units
MinMaxVector.intClippingRange N/A 90 0 1000 thrpt 4 1068.265 1039.087 ops/ms
MinMaxVector.intClippingRange N/A 100 0 1000 thrpt 4 1067.705 1038.760 ops/ms
MinMaxVector.intLoopMax 50 N/A N/A 2048 thrpt 4 605.015 604.364 ops/ms
MinMaxVector.intLoopMax 80 N/A N/A 2048 thrpt 4 605.169 604.366 ops/ms
MinMaxVector.intLoopMax 100 N/A N/A 2048 thrpt 4 604.527 604.494 ops/ms
MinMaxVector.intLoopMin 50 N/A N/A 2048 thrpt 4 605.099 605.057 ops/ms
MinMaxVector.intLoopMin 80 N/A N/A 2048 thrpt 4 495.071 605.080 ops/ms (+22%)
MinMaxVector.intLoopMin 100 N/A N/A 2048 thrpt 4 495.134 495.047 ops/ms
MinMaxVector.intReductionMultiplyMax 50 N/A N/A 2048 thrpt 4 407.953 407.987 ops/ms
MinMaxVector.intReductionMultiplyMax 80 N/A N/A 2048 thrpt 4 407.861 408.005 ops/ms
MinMaxVector.intReductionMultiplyMax 100 N/A N/A 2048 thrpt 4 873.915 407.995 ops/ms (-53%)
MinMaxVector.intReductionMultiplyMin 50 N/A N/A 2048 thrpt 4 408.019 407.987 ops/ms
MinMaxVector.intReductionMultiplyMin 80 N/A N/A 2048 thrpt 4 407.971 408.009 ops/ms
MinMaxVector.intReductionMultiplyMin 100 N/A N/A 2048 thrpt 4 407.970 407.956 ops/ms
MinMaxVector.intReductionSimpleMax 50 N/A N/A 2048 thrpt 4 460.443 460.514 ops/ms
MinMaxVector.intReductionSimpleMax 80 N/A N/A 2048 thrpt 4 460.484 460.581 ops/ms
MinMaxVector.intReductionSimpleMax 100 N/A N/A 2048 thrpt 4 1015.601 460.446 ops/ms (-54%)
MinMaxVector.intReductionSimpleMin 50 N/A N/A 2048 thrpt 4 460.494 460.532 ops/ms
MinMaxVector.intReductionSimpleMin 80 N/A N/A 2048 thrpt 4 460.489 460.451 ops/ms
MinMaxVector.intReductionSimpleMin 100 N/A N/A 2048 thrpt 4 1021.420 460.435 ops/ms (-55%)
This time we see an improvement in `intLoopMin` @ 80% but again it was not observable in the `perfasm` run.
`intReductionMultiplyMax` and `intReductionSimpleMax` @ 100% have regressions, the familiar one of cmp+mov vs cmov. `intReductionMultiplyMin` @ 100% does not have a regression for the same reasons above, both use cmov.
The interesting thing is `intReductionSimpleMin` @ 100%. We see a regression there but I didn't observe it with the `perfasm` run. So, this could be due to variance in the application of `cmov` or not?
-------------
PR Comment: https://git.openjdk.org/jdk/pull/20098#issuecomment-2670609470
More information about the core-libs-dev
mailing list