RFR: 8307513: C2: intrinsify Math.max(long,long) and Math.min(long,long) [v4]
Galder Zamarreño
galder at openjdk.org
Thu Oct 17 10:15:25 UTC 2024
On Thu, 17 Oct 2024 10:10:56 GMT, Galder Zamarreño <galder at openjdk.org> wrote:
>> This patch intrinsifies `Math.max(long, long)` and `Math.min(long, long)` in order to help improve vectorization performance.
>>
>> Currently vectorization does not kick in for loops containing either of these calls because of the following error:
>>
>>
>> VLoop::check_preconditions: failed: control flow in loop not allowed
>>
>>
>> The control flow is due to the java implementation for these methods, e.g.
>>
>>
>> public static long max(long a, long b) {
>> return (a >= b) ? a : b;
>> }
>>
>>
>> This patch intrinsifies the calls to replace the CmpL + Bool nodes for MaxL/MinL nodes respectively.
>> By doing this, vectorization no longer finds the control flow and so it can carry out the vectorization.
>> E.g.
>>
>>
>> SuperWord::transform_loop:
>> Loop: N518/N126 counted [int,int),+4 (1025 iters) main has_sfpt strip_mined
>> 518 CountedLoop === 518 246 126 [[ 513 517 518 242 521 522 422 210 ]] inner stride: 4 main of N518 strip mined !orig=[419],[247],[216],[193] !jvms: Test::test @ bci:14 (line 21)
>>
>>
>> Applying the same changes to `ReductionPerf` as in https://github.com/openjdk/jdk/pull/13056, we can compare the results before and after. Before the patch, on darwin/aarch64 (M1):
>>
>>
>> ==============================
>> Test summary
>> ==============================
>> TEST TOTAL PASS FAIL ERROR
>> jtreg:test/hotspot/jtreg/compiler/loopopts/superword/ReductionPerf.java
>> 1 1 0 0
>> ==============================
>> TEST SUCCESS
>>
>> long min 1155
>> long max 1173
>>
>>
>> After the patch, on darwin/aarch64 (M1):
>>
>>
>> ==============================
>> Test summary
>> ==============================
>> TEST TOTAL PASS FAIL ERROR
>> jtreg:test/hotspot/jtreg/compiler/loopopts/superword/ReductionPerf.java
>> 1 1 0 0
>> ==============================
>> TEST SUCCESS
>>
>> long min 1042
>> long max 1042
>>
>>
>> This patch does not add an platform-specific backend implementations for the MaxL/MinL nodes.
>> Therefore, it still relies on the macro expansion to transform those into CMoveL.
>>
>> I've run tier1 and hotspot compiler tests on darwin/aarch64 and got these results:
>>
>>
>> ==============================
>> Test summary
>> ==============================
>> TEST TOTAL PA...
>
> Galder Zamarreño has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 30 additional commits since the last revision:
>
> - Use same default size as in other vector reduction benchmarks
> - Renamed benchmark class
> - Double/Float tests only when avx enabled
> - Make state class non-final
> - Restore previous benchmark iterations and default param size
> - Add clipping range benchmark that uses min/max
> - Encapsulate benchmark state within an inner class
> - Avoid creating result array in benchmark method
> - Merge branch 'master' into topic.intrinsify-max-min-long
> - Revert "Implement cmovL as a jump+mov branch"
>
> This reverts commit 1522e26bf66c47b780ebd0d0d0c4f78a4c564e44.
> - ... and 20 more: https://git.openjdk.org/jdk/compare/52005a12...0a8718e1
I've re-run the benchmarks in non-AVX-512 and AVX-512 environments making sure no .ad changes were applied.
I've also added clipping range benchmarks suggested by @theRealAph.
Remember that the AVX512 and non-AVX512 results were obtained in different systems so they cannot be compared between them. AVX512 results can be compared between base and patched versions and same for non-AVX512 results.
The results for loop* and reduction* match the behaviour explained in https://github.com/openjdk/jdk/pull/20098#issuecomment-2379386872. The explanation in that comment applies here as well:
Benchmark (probability) (range) (seed) (size) Mode Cnt Score Error Units
MinMaxLoopBench.longReductionMax 50 N/A N/A 10000 thrpt 8 107.441 ± 0.092 ops/ms (non-AVX512, base)
MinMaxLoopBench.longReductionMax 80 N/A N/A 10000 thrpt 8 107.431 ± 0.057 ops/ms (non-AVX512, base)
MinMaxLoopBench.longReductionMax 100 N/A N/A 10000 thrpt 8 213.200 ± 5.070 ops/ms (non-AVX512, base)
MinMaxLoopBench.longReductionMax 50 N/A N/A 10000 thrpt 8 107.411 ± 0.088 ops/ms (non-AVX512, patch)
MinMaxLoopBench.longReductionMax 80 N/A N/A 10000 thrpt 8 107.425 ± 0.097 ops/ms (non-AVX512, patch)
MinMaxLoopBench.longReductionMax 100 N/A N/A 10000 thrpt 8 107.377 ± 0.075 ops/ms (non-AVX512, patch)
MinMaxLoopBench.longReductionMax 50 N/A N/A 10000 thrpt 8 414.214 ± 0.898 ops/ms (AVX512, base)
MinMaxLoopBench.longReductionMax 80 N/A N/A 10000 thrpt 8 414.637 ± 0.074 ops/ms (AVX512, base)
MinMaxLoopBench.longReductionMax 100 N/A N/A 10000 thrpt 8 239.570 ± 3.034 ops/ms (AVX512, base)
MinMaxLoopBench.longReductionMax 50 N/A N/A 10000 thrpt 8 414.276 ± 0.399 ops/ms (AVX512, patch)
MinMaxLoopBench.longReductionMax 80 N/A N/A 10000 thrpt 8 414.284 ± 0.342 ops/ms (AVX512, patch)
MinMaxLoopBench.longReductionMax 100 N/A N/A 10000 thrpt 8 413.860 ± 1.831 ops/ms (AVX512, patch)
The clipping range results show big improvements:
Benchmark (probability) (range) (seed) (size) Mode Cnt Score Error Units
MinMaxLoopBench.longClippingRange N/A 90 0 10000 thrpt 8 108.503 ± 0.399 ops/ms (non-AVX512, base)
MinMaxLoopBench.longClippingRange N/A 100 0 10000 thrpt 8 107.655 ± 1.759 ops/ms (non-AVX512, base)
MinMaxLoopBench.longClippingRange N/A 90 0 10000 thrpt 8 613.310 ± 1.140 ops/ms (non-AVX512, patch)
MinMaxLoopBench.longClippingRange N/A 100 0 10000 thrpt 8 613.282 ± 0.744 ops/ms (non-AVX512, patch)
MinMaxLoopBench.longClippingRange N/A 90 0 10000 thrpt 8 64.343 ± 0.396 ops/ms (AVX512, base)
MinMaxLoopBench.longClippingRange N/A 100 0 10000 thrpt 8 61.323 ± 6.059 ops/ms (AVX512, base)
MinMaxLoopBench.longClippingRange N/A 90 0 10000 thrpt 8 359.525 ± 0.570 ops/ms (AVX512, patch)
MinMaxLoopBench.longClippingRange N/A 100 0 10000 thrpt 8 360.284 ± 1.408 ops/ms (AVX512, patch)
The improvements in clipping range are due to vector instructions being used:
0.11% ││ 0x00007f5e000266c8: vpcmpgtq %ymm4, %ymm5, %ymm12
0.56% ││ 0x00007f5e000266cd: vblendvpd %ymm12, %ymm5, %ymm4, %ymm12
0.04% ││ 0x00007f5e000266d3: vpcmpgtq %ymm6, %ymm12, %ymm11
1.10% ││ 0x00007f5e000266d8: vblendvpd %ymm11, %ymm6, %ymm12, %ymm11
2.93% ││ 0x00007f5e000266de: vmovdqu %ymm11, 0xf0(%r9, %r10, 8)
││ ;*lastore {reexecute=0 rethrow=0 return_oop=0}
││ ; - org.openjdk.bench.java.lang.MinMaxLoopBench::longClippingRange at 35 (line 211)
││ ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxLoopBench_longClippingRange_jmhTest::longClippingRange_thrpt_jmhStub at 19 (line 124)
Whereas without the changes it uses scalar instructions:
0.56% │↗ 0x00007f9e98025e83: cmpq %r8, %rdx
2.98% ╭ ││ 0x00007f9e98025e86: jle 0x7f9e98025e8b ;*ifgt {reexecute=0 rethrow=0 return_oop=0}
│ ││ ; - java.lang.Math::min at 3 (line 2132)
│ ││ ; - org.openjdk.bench.java.lang.MinMaxLoopBench::longClippingRange at 32 (line 211)
│ ││ ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxLoopBench_longClippingRange_jmhTest::longClippingRange_thrpt_jmhStub at 19 (line 124)
0.03% │ ││ 0x00007f9e98025e88: movq %r8, %rdx ;*lreturn {reexecute=0 rethrow=0 return_oop=0}
│ ││ ; - java.lang.Math::min at 11 (line 2132)
│ ││ ; - org.openjdk.bench.java.lang.MinMaxLoopBench::longClippingRange at 32 (line 211)
│ ││ ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxLoopBench_longClippingRange_jmhTest::longClippingRange_thrpt_jmhStub at 19 (line 124)
0.04% ↘ ││ 0x00007f9e98025e8b: movq %rdx, 0x28(%r13, %rcx, 8);*lastore {reexecute=0 rethrow=0 return_oop=0}
││ ; - org.openjdk.bench.java.lang.MinMaxLoopBench::longClippingRange at 35 (line 211)
││ ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxLoopBench_longClippingRange_jmhTest::longClippingRange_thrpt_jmhStub at 19 (line 124)
19.79% ││ 0x00007f9e98025e90: addl $4, %ecx ;*iinc {reexecute=0 rethrow=0 return_oop=0}
││ ; - org.openjdk.bench.java.lang.MinMaxLoopBench::longClippingRange at 36 (line 210)
││ ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxLoopBench_longClippingRange_jmhTest::longClippingRange_thrpt_jmhStub at 19 (line 124)
Finally, I've fixed the float/double IR tests by adding conditionals to make sure they only run when UseAVX > 0.
-------------
PR Comment: https://git.openjdk.org/jdk/pull/20098#issuecomment-2419120069
More information about the graal-dev
mailing list