RFR: 8307513: C2: intrinsify Math.max(long,long) and Math.min(long,long) [v11]
Galder Zamarreño
galder at openjdk.org
Fri Feb 7 12:31:11 UTC 2025
On Fri, 17 Jan 2025 17:53:24 GMT, Galder Zamarreño <galder at openjdk.org> wrote:
>> This patch intrinsifies `Math.max(long, long)` and `Math.min(long, long)` in order to help improve vectorization performance.
>>
>> Currently vectorization does not kick in for loops containing either of these calls because of the following error:
>>
>>
>> VLoop::check_preconditions: failed: control flow in loop not allowed
>>
>>
>> The control flow is due to the java implementation for these methods, e.g.
>>
>>
>> public static long max(long a, long b) {
>> return (a >= b) ? a : b;
>> }
>>
>>
>> This patch intrinsifies the calls to replace the CmpL + Bool nodes for MaxL/MinL nodes respectively.
>> By doing this, vectorization no longer finds the control flow and so it can carry out the vectorization.
>> E.g.
>>
>>
>> SuperWord::transform_loop:
>> Loop: N518/N126 counted [int,int),+4 (1025 iters) main has_sfpt strip_mined
>> 518 CountedLoop === 518 246 126 [[ 513 517 518 242 521 522 422 210 ]] inner stride: 4 main of N518 strip mined !orig=[419],[247],[216],[193] !jvms: Test::test @ bci:14 (line 21)
>>
>>
>> Applying the same changes to `ReductionPerf` as in https://github.com/openjdk/jdk/pull/13056, we can compare the results before and after. Before the patch, on darwin/aarch64 (M1):
>>
>>
>> ==============================
>> Test summary
>> ==============================
>> TEST TOTAL PASS FAIL ERROR
>> jtreg:test/hotspot/jtreg/compiler/loopopts/superword/ReductionPerf.java
>> 1 1 0 0
>> ==============================
>> TEST SUCCESS
>>
>> long min 1155
>> long max 1173
>>
>>
>> After the patch, on darwin/aarch64 (M1):
>>
>>
>> ==============================
>> Test summary
>> ==============================
>> TEST TOTAL PASS FAIL ERROR
>> jtreg:test/hotspot/jtreg/compiler/loopopts/superword/ReductionPerf.java
>> 1 1 0 0
>> ==============================
>> TEST SUCCESS
>>
>> long min 1042
>> long max 1042
>>
>>
>> This patch does not add an platform-specific backend implementations for the MaxL/MinL nodes.
>> Therefore, it still relies on the macro expansion to transform those into CMoveL.
>>
>> I've run tier1 and hotspot compiler tests on darwin/aarch64 and got these results:
>>
>>
>> ==============================
>> Test summary
>> ==============================
>> TEST TOTAL PA...
>
> Galder Zamarreño has updated the pull request incrementally with one additional commit since the last revision:
>
> Fix typo
@eastig is helping with the results on aarch64, so I will verify the numbers in same way done below for x86_64 once he provides me with the results.
Here is a summary of the benchmarking results I'm seeing on x86_64 (I will push an update that just merges the latest master shortly).
First I will go through the results of `MinMaxVector`. This benchmark computes throughput by default so the higher the number the better.
# MinMaxVector AVX-512
Following are results with AVX-512 instructions:
Benchmark (probability) (range) (seed) (size) Mode Cnt Baseline Patch Units
MinMaxVector.longClippingRange N/A 90 0 1000 thrpt 4 834.127 3688.961 ops/ms
MinMaxVector.longClippingRange N/A 100 0 1000 thrpt 4 1147.010 3687.721 ops/ms
MinMaxVector.longLoopMax 50 N/A N/A 2048 thrpt 4 1126.718 1072.812 ops/ms
MinMaxVector.longLoopMax 80 N/A N/A 2048 thrpt 4 1070.921 1070.538 ops/ms
MinMaxVector.longLoopMax 100 N/A N/A 2048 thrpt 4 510.483 1073.081 ops/ms
MinMaxVector.longLoopMin 50 N/A N/A 2048 thrpt 4 935.658 1016.910 ops/ms
MinMaxVector.longLoopMin 80 N/A N/A 2048 thrpt 4 1007.410 933.774 ops/ms
MinMaxVector.longLoopMin 100 N/A N/A 2048 thrpt 4 536.582 1017.337 ops/ms
MinMaxVector.longReductionMax 50 N/A N/A 2048 thrpt 4 967.288 966.945 ops/ms
MinMaxVector.longReductionMax 80 N/A N/A 2048 thrpt 4 967.327 967.382 ops/ms
MinMaxVector.longReductionMax 100 N/A N/A 2048 thrpt 4 849.689 967.327 ops/ms
MinMaxVector.longReductionMin 50 N/A N/A 2048 thrpt 4 966.323 967.275 ops/ms
MinMaxVector.longReductionMin 80 N/A N/A 2048 thrpt 4 967.340 967.228 ops/ms
MinMaxVector.longReductionMin 100 N/A N/A 2048 thrpt 4 880.921 967.233 ops/ms
### `longReduction[Min|Max]` performance improves slightly when probability is 100
Without the patch the code uses compare instructions:
7.83% ││││ │││↗ │ 0x00007f4f700fb305: imulq $0xb, 0x20(%r14, %r8, 8), %rdi
││││ ││││ │ ;*lmul {reexecute=0 rethrow=0 return_oop=0}
││││ ││││ │ ; - org.openjdk.bench.java.lang.MinMaxVector::longReductionMax at 24 (line 255)
││││ ││││ │ ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longReductionMax_jmhTest::longReductionMax_thrpt_jmhStub at 19 (line 124)
5.64% ││││ ││││ │ 0x00007f4f700fb30b: cmpq %rdi, %rdx
││││╭││││ │ 0x00007f4f700fb30e: jge 0x7f4f700fb32c ;*lreturn {reexecute=0 rethrow=0 return_oop=0}
│││││││││ │ ; - java.lang.Math::max at 11 (line 2037)
│││││││││ │ ; - org.openjdk.bench.java.lang.MinMaxVector::longReductionMax at 30 (line 256)
│││││││││ │ ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longReductionMax_jmhTest::longReductionMax_thrpt_jmhStub at 19 (line 124)
12.82% │││││││││↗ │ 0x00007f4f700fb310: imulq $0xb, 0x28(%r14, %r8, 8), %rbp
││││││││││ │ ;*lmul {reexecute=0 rethrow=0 return_oop=0}
││││││││││ │ ; - org.openjdk.bench.java.lang.MinMaxVector::longReductionMax at 24 (line 255)
││││││││││ │ ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longReductionMax_jmhTest::longReductionMax_thrpt_jmhStub at 19 (line 124)
7.46% ││││││││││ │ 0x00007f4f700fb316: cmpq %rbp, %rdi
│││││╰││││ │ 0x00007f4f700fb319: jl 0x7f4f700fb2e0 ;*iflt {reexecute=0 rethrow=0 return_oop=0}
│││││ ││││ │ ; - java.lang.Math::max at 3 (line 2037)
│││││ ││││ │ ; - org.openjdk.bench.java.lang.MinMaxVector::longReductionMax at 30 (line 256)
│││││ ││││ │ ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longReductionMax_jmhTest::longReductionMax_thrpt_jmhStub at 19 (line 124)
And with the patch these become vectorized:
│ ││ ↗││││ 0x00007f56280fad10: vpmullq 0xf0(%rdx, %rsi, 8), %ymm10, %ymm4
8.35% │ ││ │││││ 0x00007f56280fad1b: vpmullq 0xd0(%rdx, %rsi, 8), %ymm10, %ymm5
4.27% │ ││ │││││ 0x00007f56280fad26: vpmullq 0x10(%rdx, %rsi, 8), %ymm10, %ymm6
│ ││ │││││ ; {no_reloc}
4.22% │ ││ │││││ 0x00007f56280fad31: vpmullq 0x30(%rdx, %rsi, 8), %ymm10, %ymm7
4.00% │ ││ │││││ 0x00007f56280fad3c: vpmullq 0xb0(%rdx, %rsi, 8), %ymm10, %ymm8
4.13% │ ││ │││││ 0x00007f56280fad47: vpmullq 0x50(%rdx, %rsi, 8), %ymm10, %ymm11
4.10% │ ││ │││││ 0x00007f56280fad52: vpmullq 0x70(%rdx, %rsi, 8), %ymm10, %ymm12
4.13% │ ││ │││││ 0x00007f56280fad5d: vpmullq 0x90(%rdx, %rsi, 8), %ymm10, %ymm13
4.03% │ ││ │││││ 0x00007f56280fad68: vpmaxsq %ymm6, %ymm3, %ymm3
│ ││ │││││ 0x00007f56280fad6e: vpmaxsq %ymm7, %ymm3, %ymm3
4.72% │ ││ │││││ 0x00007f56280fad74: vpmaxsq %ymm11, %ymm3, %ymm3
│ ││ │││││ 0x00007f56280fad7a: vpmaxsq %ymm12, %ymm3, %ymm3
8.40% │ ││ │││││ 0x00007f56280fad80: vpmaxsq %ymm13, %ymm3, %ymm3
23.11% │ ││ │││││ 0x00007f56280fad86: vpmaxsq %ymm8, %ymm3, %ymm3
2.15% │ ││ │││││ 0x00007f56280fad8c: vpmaxsq %ymm5, %ymm3, %ymm3
8.79% │ ││ │││││ 0x00007f56280fad92: vpmaxsq %ymm4, %ymm3, %ymm3 ;*invokestatic max {reexecute=0 rethrow=0 return_oop=0}
│ ││ │││││ ; - org.openjdk.bench.java.lang.MinMaxVector::longReductionMax at 30 (line 256)
│ ││ │││││ ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longReductionMax_jmhTest::longReductionMax_thrpt_jmhStub at 19 (line 124)
### `longLoop[Min|Max]` performance improves considerably when probability is 100
Without the patch the code uses compare + move instructions:
4.53% ││││ ││ │ │ 0x00007f96b40faf33: movq 0x18(%rax, %rsi, 8), %r13;*laload {reexecute=0 rethrow=0 return_oop=0}
││││ ││ │ │ ; - org.openjdk.bench.java.lang.MinMaxVector::longLoopMax at 20 (line 236)
││││ ││ │ │ ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longLoopMax_jmhTest::longLoopMax_thrpt_jmhStub at 19 (line 124)
2.69% ││││ ││ │ │ 0x00007f96b40faf38: cmpq %r11, %r13
││││╭ ││ │ │ 0x00007f96b40faf3b: jl 0x7f96b40faf67 ;*lreturn {reexecute=0 rethrow=0 return_oop=0}
│││││ ││ │ │ ; - java.lang.Math::max at 11 (line 2037)
│││││ ││ │ │ ; - org.openjdk.bench.java.lang.MinMaxVector::longLoopMax at 27 (line 236)
│││││ ││ │ │ ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longLoopMax_jmhTest::longLoopMax_thrpt_jmhStub at 19 (line 124)
8.75% │││││ ││↗ │ │ 0x00007f96b40faf3d: movq %r13, 0x18(%rbp, %rsi, 8);*lastore {reexecute=0 rethrow=0 return_oop=0}
│││││ │││ │ │ ; - org.openjdk.bench.java.lang.MinMaxVector::longLoopMax at 30 (line 236)
│││││ │││ │ │ ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longLoopMax_jmhTest::longLoopMax_thrpt_jmhStub at 19 (line 124)
And with the patch those become vectorized:
3.55% │ ││ 0x00007f13c80fa18a: vmovdqu 0xf0(%rbx, %r10, 8), %ymm5
│ ││ 0x00007f13c80fa194: vmovdqu 0xf0(%rdi, %r10, 8), %ymm6
2.35% │ ││ 0x00007f13c80fa19e: vpmaxsq %ymm6, %ymm5, %ymm5
5.03% │ ││ 0x00007f13c80fa1a4: vmovdqu %ymm5, 0xf0(%rax, %r10, 8)
│ ││ ;*lastore {reexecute=0 rethrow=0 return_oop=0}
│ ││ ; - org.openjdk.bench.java.lang.MinMaxVector::longLoopMax at 30 (line 236)
│ ││ ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longLoopMax_jmhTest::longLoopMax_thrpt_jmhStub at 19 (line 124)
It's interesting to observe that at probabilites of 50/80% the baseline performs better than at 100%. The reason for that is because at 50/80% the baseline already vectorizes. So, why isn't the baseline vectorizing at 100% probability?
VLoop::check_preconditions
Loop: N1256/N463 limit_check counted [int,int),+4 (3161 iters) main rc has_sfpt strip_mined
1256 CountedLoop === 1256 598 463 [[ 1256 1257 1271 1272 ]] inner stride: 4 main of N1256 strip mined !orig=[1126],[599],[590],[307] !jvms: MinMaxVector::longLoopMax @ bci:10 (line 236) MinMaxVector_longLoopMax_jmhTest::longLoopMax_thrpt_jmhStub @ bci:19 (line 124)
VLoop::check_preconditions: fails because of control flow.
cl_exit 594 594 CountedLoopEnd === 415 593 [[ 1275 463 ]] [lt] P=0.999684, C=707717.000000 !orig=[462] !jvms: MinMaxVector::longLoopMax @ bci:7 (line 235) MinMaxVector_longLoopMax_jmhTest::longLoopMax_thrpt_jmhStub @ bci:19 (line 124)
cl_exit->in(0) 415 415 Region === 415 411 412 [[ 415 594 416 451 ]] !orig=[423] !jvms: Math::max @ bci:11 (line 2037) MinMaxVector::longLoopMax @ bci:27 (line 236) MinMaxVector_longLoopMax_jmhTest::longLoopMax_thrpt_jmhStub @ bci:19 (line 124)
lpt->_head 1256 1256 CountedLoop === 1256 598 463 [[ 1256 1257 1271 1272 ]] inner stride: 4 main of N1256 strip mined !orig=[1126],[599],[590],[307] !jvms: MinMaxVector::longLoopMax @ bci:10 (line 236) MinMaxVector_longLoopMax_jmhTest::longLoopMax_thrpt_jmhStub @ bci:19 (line 124)
Loop: N1256/N463 limit_check counted [int,int),+4 (3161 iters) main rc has_sfpt strip_mined
VLoop::check_preconditions: failed: control flow in loop not allowed
At 100% probability baseline fails to vectorize because it observes a control flow. This control flow is not the one you see in min/max implementations, but this is one added by HotSpot as a result of the JIT profiling. It observes that one branch is always taken so it optimizes for that, and adds a branch for the uncommon case where the branch is not taken.
### `longClippingRange` performance improves considerably
Without the patch the code uses compare + move instructions:
3.39% ││ │ ││ │ 0x00007febb40fb175: cmpq %rbp, %rcx
││ │╭ ││ │ 0x00007febb40fb178: jge 0x7febb40fb17d ;*iflt {reexecute=0 rethrow=0 return_oop=0}
││ ││ ││ │ ; - java.lang.Math::max at 3 (line 2037)
││ ││ ││ │ ; - org.openjdk.bench.java.lang.MinMaxVector::longClippingRange at 25 (line 220)
││ ││ ││ │ ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longClippingRange_jmhTest::longClippingRange_thrpt_jmhStub at 19 (line 124)
2.69% ││ ││ ││ │ 0x00007febb40fb17a: movq %rbp, %rcx ;*lreturn {reexecute=0 rethrow=0 return_oop=0}
││ ││ ││ │ ; - java.lang.Math::max at 11 (line 2037)
││ ││ ││ │ ; - org.openjdk.bench.java.lang.MinMaxVector::longClippingRange at 25 (line 220)
││ ││ ││ │ ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longClippingRange_jmhTest::longClippingRange_thrpt_jmhStub at 19 (line 124)
4.35% ││ │↘ ││ │ 0x00007febb40fb17d: nop
2.93% ││ │ ││ │ 0x00007febb40fb180: cmpq %r8, %rcx
││ │ ╭ ││ │ 0x00007febb40fb183: jle 0x7febb40fb188 ;*ifgt {reexecute=0 rethrow=0 return_oop=0}
││ │ │ ││ │ ; - java.lang.Math::min at 3 (line 2132)
││ │ │ ││ │ ; - org.openjdk.bench.java.lang.MinMaxVector::longClippingRange at 32 (line 220)
││ │ │ ││ │ ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longClippingRange_jmhTest::longClippingRange_thrpt_jmhStub at 19 (line 124)
3.51% ││ │ │ ││ │ 0x00007febb40fb185: movq %r8, %rcx ;*lreturn {reexecute=0 rethrow=0 return_oop=0}
││ │ │ ││ │ ; - java.lang.Math::min at 11 (line 2132)
││ │ │ ││ │ ; - org.openjdk.bench.java.lang.MinMaxVector::longClippingRange at 32 (line 220)
││ │ │ ││ │ ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longClippingRange_jmhTest::longClippingRange_thrpt_jmhStub at 19 (line 124)
4.26% ││ │ ↘ ││ │ 0x00007febb40fb188: movq %rcx, 0x10(%rsi, %r9, 8);*lastore {reexecute=0 rethrow=0 return_oop=0}
││ │ ││ │ ; - org.openjdk.bench.java.lang.MinMaxVector::longClippingRange at 35 (line 220)
││ │ ││ │ ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longClippingRange_jmhTest::longClippingRange_thrpt_jmhStub at 19 (line 124)
With the patch these become vectorized:
0.20% ││↗ ↗ 0x00007f10180fd15c: vmovdqu 0x10(%r11, %rcx, 8), %ymm6
│││ │ 0x00007f10180fd163: vpmaxsq %ymm6, %ymm7, %ymm6
│││ │ 0x00007f10180fd169: vpminsq %ymm8, %ymm6, %ymm6
│││ │ 0x00007f10180fd16f: vmovdqu %ymm6, 0x10(%r8, %rcx, 8);*lastore {reexecute=0 rethrow=0 return_oop=0}
│││ │ ; - org.openjdk.bench.java.lang.MinMaxVector::longClippingRange at 35 (line 220)
│││ │ ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longClippingRange_jmhTest::longClippingRange_thrpt_jmhStub at 19 (line 124)
# `MinMaxVector` AVX2
Following are results on the same machine as above but forcing AVX2 to be used instead of AVX-512:
Benchmark (probability) (range) (seed) (size) Mode Cnt Baseline Patch Units
MinMaxVector.longClippingRange N/A 90 0 1000 thrpt 4 832.132 1813.609 ops/ms
MinMaxVector.longClippingRange N/A 100 0 1000 thrpt 4 832.546 1814.477 ops/ms
MinMaxVector.longLoopMax 50 N/A N/A 2048 thrpt 4 938.372 939.313 ops/ms
MinMaxVector.longLoopMax 80 N/A N/A 2048 thrpt 4 934.964 945.124 ops/ms
MinMaxVector.longLoopMax 100 N/A N/A 2048 thrpt 4 512.076 937.287 ops/ms
MinMaxVector.longLoopMin 50 N/A N/A 2048 thrpt 4 999.455 689.750 ops/ms
MinMaxVector.longLoopMin 80 N/A N/A 2048 thrpt 4 1000.352 876.326 ops/ms
MinMaxVector.longLoopMin 100 N/A N/A 2048 thrpt 4 536.359 999.475 ops/ms
MinMaxVector.longReductionMax 50 N/A N/A 2048 thrpt 4 409.413 409.363 ops/ms
MinMaxVector.longReductionMax 80 N/A N/A 2048 thrpt 4 409.374 409.141 ops/ms
MinMaxVector.longReductionMax 100 N/A N/A 2048 thrpt 4 883.614 409.318 ops/ms
MinMaxVector.longReductionMin 50 N/A N/A 2048 thrpt 4 404.723 404.705 ops/ms
MinMaxVector.longReductionMin 80 N/A N/A 2048 thrpt 4 404.755 404.748 ops/ms
MinMaxVector.longReductionMin 100 N/A N/A 2048 thrpt 4 848.784 404.669 ops/ms
### `longClippingRange` performance improves considerably
Baseline uses compare + move instructions as shown above. But the patched version improves in spite of not being able to use AVX-512 instructions such as `vpmaxsq`. The performance improvements come from using other vectorized compare + vectorized move instructions:
│ │ ││││ 0x00007f9aa40f94ac: vpcmpgtq %ymm6, %ymm7, %ymm12
3.79% │ │ ││││ 0x00007f9aa40f94b1: vblendvpd %ymm12, %ymm7, %ymm6, %ymm12
3.72% │ │ ││││ 0x00007f9aa40f94b7: vpcmpgtq %ymm8, %ymm12, %ymm10
│ │ ││││ 0x00007f9aa40f94bc: vblendvpd %ymm10, %ymm8, %ymm12, %ymm10
3.78% │ │ ││││ 0x00007f9aa40f94c2: vmovdqu %ymm10, 0xf0(%r8, %rcx, 8)
│ │ ││││ ;*lastore {reexecute=0 rethrow=0 return_oop=0}
│ │ ││││ ; - org.openjdk.bench.java.lang.MinMaxVector::longClippingRange at 35 (line 220)
│ │ ││││ ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longClippingRange_jmhTest::longClippingRange_thrpt_jmhStub at 19 (line 124)
### `longReduction[Min|Max]` performance drops considerably when probability is 100
Baseline uses compare + move instruction to implement this:
││││ ││││ │ ;*lmul {reexecute=0 rethrow=0 return_oop=0}
││││ ││││ │ ; - org.openjdk.bench.java.lang.MinMaxVector::longReductionMax at 24 (line 255)
││││ ││││ │ ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longReductionMax_jmhTest::longReductionMax_thrpt_jmhStub at 19 (line 124)
6.30% ││││ ││││ │ 0x00007fd5580f678b: cmpq %rdi, %rdx
││││╭││││ │ 0x00007fd5580f678e: jge 0x7fd5580f67ac ;*lreturn {reexecute=0 rethrow=0 return_oop=0}
│││││││││ │ ; - java.lang.Math::max at 11 (line 2037)
│││││││││ │ ; - org.openjdk.bench.java.lang.MinMaxVector::longReductionMax at 30 (line 256)
│││││││││ │ ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longReductionMax_jmhTest::longReductionMax_thrpt_jmhStub at 19 (line 124)
12.88% │││││││││↗ │ 0x00007fd5580f6790: imulq $0xb, 0x28(%r14, %r8, 8), %rbp
││││││││││ │ ;*lmul {reexecute=0 rethrow=0 return_oop=0}
││││││││││ │ ; - org.openjdk.bench.java.lang.MinMaxVector::longReductionMax at 24 (line 255)
││││││││││ │ ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longReductionMax_jmhTest::longReductionMax_thrpt_jmhStub at 19 (line 124)
7.55% ││││││││││ │ 0x00007fd5580f6796: cmpq %rbp, %rdi
│││││╰││││ │ 0x00007fd5580f6799: jl 0x7fd5580f6760 ;*iflt {reexecute=0 rethrow=0 return_oop=0}
│││││ ││││ │ ; - java.lang.Math::max at 3 (line 2037)
│││││ ││││ │ ; - org.openjdk.bench.java.lang.MinMaxVector::longReductionMax at 30 (line 256)
│││││ ││││ │ ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longReductionMax_jmhTest::longReductionMax_thrpt_jmhStub at 19 (line 124)
With the patch the code uses conditional moves instead:
0.05% ↗│ 0x00007fc4700f5253: imulq $0xb, 0x28(%r14, %r11, 8), %rdx
10.62% ││ 0x00007fc4700f5259: imulq $0xb, 0x20(%r14, %r11, 8), %rax
0.63% ││ 0x00007fc4700f525f: imulq $0xb, 0x10(%r14, %r11, 8), %r8
││ ;*lmul {reexecute=0 rethrow=0 return_oop=0}
││ ; - org.openjdk.bench.java.lang.MinMaxVector::longReductionMax at 24 (line 255)
││ ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longReductionMax_jmhTest::longReductionMax_thrpt_jmhStub at 19 (line 124)
10.34% ││ 0x00007fc4700f5265: cmpq %r8, %r13
2.37% ││ 0x00007fc4700f5268: cmovlq %r8, %r13 ;*invokestatic max {reexecute=0 rethrow=0 return_oop=0}
││ ; - org.openjdk.bench.java.lang.MinMaxVector::longReductionMax at 30 (line 256)
││ ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longReductionMax_jmhTest::longReductionMax_thrpt_jmhStub at 19 (line 124)
1.15% ││ 0x00007fc4700f526c: imulq $0xb, 0x18(%r14, %r11, 8), %r8
││ ;*lmul {reexecute=0 rethrow=0 return_oop=0}
││ ; - org.openjdk.bench.java.lang.MinMaxVector::longReductionMax at 24 (line 255)
││ ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longReductionMax_jmhTest::longReductionMax_thrpt_jmhStub at 19 (line 124)
9.28% ││ 0x00007fc4700f5272: cmpq %r8, %r13
3.82% ││ 0x00007fc4700f5275: cmovlq %r8, %r13
21.61% ││ 0x00007fc4700f5279: cmpq %rax, %r13
11.55% ││ 0x00007fc4700f527c: cmovlq %rax, %r13
4.48% ││ 0x00007fc4700f5280: cmpq %rdx, %r13
11.76% ││ 0x00007fc4700f5283: cmovlq %rdx, %r13 ;*invokestatic max {reexecute=0 rethrow=0 return_oop=0}
││ ; - org.openjdk.bench.java.lang.MinMaxVector::longReductionMax at 30 (line 256)
││ ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longReductionMax_jmhTest::longReductionMax_thrpt_jmhStub at 19 (line 124)
When one of the branches is taken always or almost always, the branched code of baseline can be optimized with branch prediction. However, the conditional move instructions force the CPU to compute both sides of the branch, so it performs worse in this scenario.
Why vectorized instructions are not used in this scenario? Vector instructions for min/max are not available with AVX2 and the trace vectorization signals it:
PackSet::print: 3 packs
Pack: 0
0: 1119 LoadL === 1105 343 1120 [[ 1117 ]] @long[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact+any *, idx=8; #long (does not depend only on test, unknown control) !orig=997,663,[457] !jvms: MinMaxVector::longReductionMax @ bci:23 (line 255) MinMaxVector_longReductionMax_jmhTest::longReductionMax_thrpt_jmhStub @ bci:19 (line 124)
1: 1112 LoadL === 1105 343 1113 [[ 1111 ]] @long[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact+any *, idx=8; #long (does not depend only on test, unknown control) !orig=663,[457] !jvms: MinMaxVector::longReductionMax @ bci:23 (line 255) MinMaxVector_longReductionMax_jmhTest::longReductionMax_thrpt_jmhStub @ bci:19 (line 124)
2: 997 LoadL === 1105 343 998 [[ 996 ]] @long[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact+any *, idx=8; #long (does not depend only on test, unknown control) !orig=663,[457] !jvms: MinMaxVector::longReductionMax @ bci:23 (line 255) MinMaxVector_longReductionMax_jmhTest::longReductionMax_thrpt_jmhStub @ bci:19 (line 124)
3: 663 LoadL === 1105 343 455 [[ 458 ]] @long[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact+any *, idx=8; #long (does not depend only on test, unknown control) !orig=[457] !jvms: MinMaxVector::longReductionMax @ bci:23 (line 255) MinMaxVector_longReductionMax_jmhTest::longReductionMax_thrpt_jmhStub @ bci:19 (line 124)
Pack: 1
0: 1117 MulL === _ 1119 162 [[ 1116 ]] !orig=996,458 !jvms: MinMaxVector::longReductionMax @ bci:24 (line 255) MinMaxVector_longReductionMax_jmhTest::longReductionMax_thrpt_jmhStub @ bci:19 (line 124)
1: 1111 MulL === _ 1112 162 [[ 1110 ]] !orig=458 !jvms: MinMaxVector::longReductionMax @ bci:24 (line 255) MinMaxVector_longReductionMax_jmhTest::longReductionMax_thrpt_jmhStub @ bci:19 (line 124)
2: 996 MulL === _ 997 162 [[ 995 ]] !orig=458 !jvms: MinMaxVector::longReductionMax @ bci:24 (line 255) MinMaxVector_longReductionMax_jmhTest::longReductionMax_thrpt_jmhStub @ bci:19 (line 124)
3: 458 MulL === _ 663 162 [[ 459 ]] !jvms: MinMaxVector::longReductionMax @ bci:24 (line 255) MinMaxVector_longReductionMax_jmhTest::longReductionMax_thrpt_jmhStub @ bci:19 (line 124)
Pack: 2
0: 1116 MaxL === _ 1128 1117 [[ 1110 ]] !orig=995,459,1012 !jvms: MinMaxVector::longReductionMax @ bci:30 (line 256) MinMaxVector_longReductionMax_jmhTest::longReductionMax_thrpt_jmhStub @ bci:19 (line 124)
1: 1110 MaxL === _ 1116 1111 [[ 995 ]] !orig=459,1012 !jvms: MinMaxVector::longReductionMax @ bci:30 (line 256) MinMaxVector_longReductionMax_jmhTest::longReductionMax_thrpt_jmhStub @ bci:19 (line 124)
2: 995 MaxL === _ 1110 996 [[ 459 ]] !orig=459,1012 !jvms: MinMaxVector::longReductionMax @ bci:30 (line 256) MinMaxVector_longReductionMax_jmhTest::longReductionMax_thrpt_jmhStub @ bci:19 (line 124)
3: 459 MaxL === _ 995 458 [[ 1128 923 570 ]] !orig=1012 !jvms: MinMaxVector::longReductionMax @ bci:30 (line 256) MinMaxVector_longReductionMax_jmhTest::longReductionMax_thrpt_jmhStub @ bci:19 (line 124)
WARNING: Removed pack: not implemented at any smaller size:
0: 1116 MaxL === _ 1128 1117 [[ 1110 ]] !orig=995,459,1012 !jvms: MinMaxVector::longReductionMax @ bci:30 (line 256) MinMaxVector_longReductionMax_jmhTest::longReductionMax_thrpt_jmhStub @ bci:19 (line 124)
1: 1110 MaxL === _ 1116 1111 [[ 995 ]] !orig=459,1012 !jvms: MinMaxVector::longReductionMax @ bci:30 (line 256) MinMaxVector_longReductionMax_jmhTest::longReductionMax_thrpt_jmhStub @ bci:19 (line 124)
2: 995 MaxL === _ 1110 996 [[ 459 ]] !orig=459,1012 !jvms: MinMaxVector::longReductionMax @ bci:30 (line 256) MinMaxVector_longReductionMax_jmhTest::longReductionMax_thrpt_jmhStub @ bci:19 (line 124)
3: 459 MaxL === _ 995 458 [[ 1128 923 570 ]] !orig=1012 !jvms: MinMaxVector::longReductionMax @ bci:30 (line 256) MinMaxVector_longReductionMax_jmhTest::longReductionMax_thrpt_jmhStub @ bci:19 (line 124)
After SuperWord::split_packs_only_implemented_with_smaller_size
One interesting question option to explore here would be if MaxL/MinL could be implemented in terms of vectorized compare instructions, as shown above in the `longClippingRange` scenario. Thoughts @rwestrel @eme64?
# `VectorReduction2.WithSuperword` on AVX-512 machine
As requested by Emanuel I've also run this benchmark. Note that the results here are time per op, so the lower the number the better:
Benchmark (SIZE) (seed) Mode Cnt Baseline Patch Units
VectorReduction2.WithSuperword.longMaxBig 2048 0 avgt 3 3970.527 1918.821 ns/op
VectorReduction2.WithSuperword.longMaxDotProduct 2048 0 avgt 3 1369.634 1055.762 ns/op
VectorReduction2.WithSuperword.longMaxSimple 2048 0 avgt 3 722.314 2172.064 ns/op
VectorReduction2.WithSuperword.longMinBig 2048 0 avgt 3 3996.694 1918.398 ns/op
VectorReduction2.WithSuperword.longMinDotProduct 2048 0 avgt 3 1363.687 1056.375 ns/op
VectorReduction2.WithSuperword.longMinSimple 2048 0 avgt 3 718.150 2179.478 ns/op
`long[Min|Max]Big` and `long[Min|Max]DotProduct` benchmarks show considerable improvements,
but something odd is happening in `long[Min|Max]Simple`.
### `long[Min|Max]Simple` performance drops considerably
Baseline uses compare + moves instructions:
8.05% ││ ││↗ │ 0x00007f9d580f569b: movq 0x18(%r13, %r11, 8), %r8;*laload {reexecute=0 rethrow=0 return_oop=0}
││ │││ │ ; - org.openjdk.bench.vm.compiler.VectorReduction2::longMaxSimple at 22 (line 1054)
││ │││ │ ; - org.openjdk.bench.vm.compiler.jmh_generated.VectorReduction2_WithSuperword_longMaxSimple_jmhTest::longMaxSimple_avgt_jmhStub at 17 (line 190)
0.23% ││ │││ │ 0x00007f9d580f56a0: cmpq %r8, %rsi
││╭ │││ │ 0x00007f9d580f56a3: jl 0x7f9d580f5713 ;*lreturn {reexecute=0 rethrow=0 return_oop=0}
│││ │││ │ ; - java.lang.Math::max at 11 (line 2037)
│││ │││ │ ; - org.openjdk.bench.vm.compiler.VectorReduction2::longMaxSimple at 28 (line 1055)
│││ │││ │ ; - org.openjdk.bench.vm.compiler.jmh_generated.VectorReduction2_WithSuperword_longMaxSimple_jmhTest::longMaxSimple_avgt_jmhStub at 17 (line 190)
Patched version uses conditional moves instead of vectorized instructions:
2.76% ││ 0x00007fcd180f695c: movq 0x18(%r14, %r11, 8), %rdi;*laload {reexecute=0 rethrow=0 return_oop=0}
││ ; - org.openjdk.bench.vm.compiler.VectorReduction2::longMaxSimple at 22 (line 1054)
││ ; - org.openjdk.bench.vm.compiler.jmh_generated.VectorReduction2_WithSuperword_longMaxSimple_jmhTest::longMaxSimple_avgt_jmhStub at 17 (line 190)
││ 0x00007fcd180f6961: cmpq %rdi, %r13
3.11% ││ 0x00007fcd180f6964: cmovlq %rdi, %r13 ;*invokestatic max {reexecute=0 rethrow=0 return_oop=0}
││ ; - org.openjdk.bench.vm.compiler.VectorReduction2::longMaxSimple at 28 (line 1055)
││ ; - org.openjdk.bench.vm.compiler.jmh_generated.VectorReduction2_WithSuperword_longMaxSimple_jmhTest::longMaxSimple_avgt_jmhStub at 17 (line 190)
Why are vectorized instructions not kicking in with patch? Because superword doesn't think it's profitable to vectorize this:
PackSet::print: 2 packs
Pack: 0
0: 733 LoadL === 721 184 734 [[ 732 ]] @long[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact+any *, idx=8; #long (does not depend only on test, unknown control) !orig=669,500,[319] !jvms: VectorReduction2::longMaxSimple @ bci:22 (line 1054) VectorReduction2_WithSuperword_longMaxSimple_jmhTest::longMaxSimple_avgt_jmhStub @ bci:17 (line 190)
1: 728 LoadL === 721 184 729 [[ 727 ]] @long[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact+any *, idx=8; #long (does not depend only on test, unknown control) !orig=500,[319] !jvms: VectorReduction2::longMaxSimple @ bci:22 (line 1054) VectorReduction2_WithSuperword_longMaxSimple_jmhTest::longMaxSimple_avgt_jmhStub @ bci:17 (line 190)
2: 669 LoadL === 721 184 670 [[ 668 ]] @long[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact+any *, idx=8; #long (does not depend only on test, unknown control) !orig=500,[319] !jvms: VectorReduction2::longMaxSimple @ bci:22 (line 1054) VectorReduction2_WithSuperword_longMaxSimple_jmhTest::longMaxSimple_avgt_jmhStub @ bci:17 (line 190)
3: 500 LoadL === 721 184 317 [[ 320 ]] @long[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact+any *, idx=8; #long (does not depend only on test, unknown control) !orig=[319] !jvms: VectorReduction2::longMaxSimple @ bci:22 (line 1054) VectorReduction2_WithSuperword_longMaxSimple_jmhTest::longMaxSimple_avgt_jmhStub @ bci:17 (line 190)
Pack: 1
0: 732 MaxL === _ 743 733 [[ 727 ]] !orig=668,320,685 !jvms: VectorReduction2::longMaxSimple @ bci:28 (line 1055) VectorReduction2_WithSuperword_longMaxSimple_jmhTest::longMaxSimple_avgt_jmhStub @ bci:17 (line 190)
1: 727 MaxL === _ 732 728 [[ 668 ]] !orig=320,685 !jvms: VectorReduction2::longMaxSimple @ bci:28 (line 1055) VectorReduction2_WithSuperword_longMaxSimple_jmhTest::longMaxSimple_avgt_jmhStub @ bci:17 (line 190)
2: 668 MaxL === _ 727 669 [[ 320 ]] !orig=320,685 !jvms: VectorReduction2::longMaxSimple @ bci:28 (line 1055) VectorReduction2_WithSuperword_longMaxSimple_jmhTest::longMaxSimple_avgt_jmhStub @ bci:17 (line 190)
3: 320 MaxL === _ 668 500 [[ 743 593 456 ]] !orig=685 !jvms: VectorReduction2::longMaxSimple @ bci:28 (line 1055) VectorReduction2_WithSuperword_longMaxSimple_jmhTest::longMaxSimple_avgt_jmhStub @ bci:17 (line 190)
WARNING: Removed pack: not profitable:
0: 732 MaxL === _ 743 733 [[ 727 ]] !orig=668,320,685 !jvms: VectorReduction2::longMaxSimple @ bci:28 (line 1055) VectorReduction2_WithSuperword_longMaxSimple_jmhTest::longMaxSimple_avgt_jmhStub @ bci:17 (line 190)
1: 727 MaxL === _ 732 728 [[ 668 ]] !orig=320,685 !jvms: VectorReduction2::longMaxSimple @ bci:28 (line 1055) VectorReduction2_WithSuperword_longMaxSimple_jmhTest::longMaxSimple_avgt_jmhStub @ bci:17 (line 190)
2: 668 MaxL === _ 727 669 [[ 320 ]] !orig=320,685 !jvms: VectorReduction2::longMaxSimple @ bci:28 (line 1055) VectorReduction2_WithSuperword_longMaxSimple_jmhTest::longMaxSimple_avgt_jmhStub @ bci:17 (line 190)
3: 320 MaxL === _ 668 500 [[ 743 593 456 ]] !orig=685 !jvms: VectorReduction2::longMaxSimple @ bci:28 (line 1055) VectorReduction2_WithSuperword_longMaxSimple_jmhTest::longMaxSimple_avgt_jmhStub @ bci:17 (line 190)
WARNING: Removed pack: not profitable:
0: 733 LoadL === 721 184 734 [[ 732 ]] @long[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact+any *, idx=8; #long (does not depend only on test, unknown control) !orig=669,500,[319] !jvms: VectorReduction2::longMaxSimple @ bci:22 (line 1054) VectorReduction2_WithSuperword_longMaxSimple_jmhTest::longMaxSimple_avgt_jmhStub @ bci:17 (line 190)
1: 728 LoadL === 721 184 729 [[ 727 ]] @long[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact+any *, idx=8; #long (does not depend only on test, unknown control) !orig=500,[319] !jvms: VectorReduction2::longMaxSimple @ bci:22 (line 1054) VectorReduction2_WithSuperword_longMaxSimple_jmhTest::longMaxSimple_avgt_jmhStub @ bci:17 (line 190)
2: 669 LoadL === 721 184 670 [[ 668 ]] @long[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact+any *, idx=8; #long (does not depend only on test, unknown control) !orig=500,[319] !jvms: VectorReduction2::longMaxSimple @ bci:22 (line 1054) VectorReduction2_WithSuperword_longMaxSimple_jmhTest::longMaxSimple_avgt_jmhStub @ bci:17 (line 190)
3: 500 LoadL === 721 184 317 [[ 320 ]] @long[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact+any *, idx=8; #long (does not depend only on test, unknown control) !orig=[319] !jvms: VectorReduction2::longMaxSimple @ bci:22 (line 1054) VectorReduction2_WithSuperword_longMaxSimple_jmhTest::longMaxSimple_avgt_jmhStub @ bci:17 (line 190)
After Superword::filter_packs_for_profitable
PackSet::print: 0 packs
SuperWord::transform_loop failed: SuperWord::SLP_extract did not vectorize
How can you make it vectorize? By doing something with the value in the array before passing it to min/max. That is what `MinMaxVector.longReduction[Min|Max]` and `VectorReduction2.long[Min|Max]DotProduct` methods do.
# `VectorReduction2.NoSuperword` on AVX-512 machine
Benchmark (SIZE) (seed) Mode Cnt Baseline Patch Units
VectorReduction2.NoSuperword.longMaxBig 2048 0 avgt 3 3964.403 2966.258 ns/op
VectorReduction2.NoSuperword.longMaxDotProduct 2048 0 avgt 3 1686.373 2462.876 ns/op
VectorReduction2.NoSuperword.longMaxSimple 2048 0 avgt 3 722.219 2171.859 ns/op
VectorReduction2.NoSuperword.longMinBig 2048 0 avgt 3 3994.685 2971.143 ns/op
VectorReduction2.NoSuperword.longMinDotProduct 2048 0 avgt 3 1366.291 2428.173 ns/op
VectorReduction2.NoSuperword.longMinSimple 2048 0 avgt 3 719.218 2179.546 ns/op
Performance improves or `long[Min|Max]Big`. `long[Min|Max]Simple` suffers similar issues as shown in previous section because when not vectorized, these benchmarks fallback on conditional moves. The drop in performance in `long[Min|Max]DotProduct` needs some explanation.
### `long[Min|Max]DotProduct` performance drops considerably
Baseline uses compare + move instructions here:
5.67% │││ │││↗ │ 0x00007f3fcc0fa71d: movq 0x20(%r14, %r8, 8), %r9
5.19% │││ ││││ │ 0x00007f3fcc0fa722: imulq 0x20(%rax, %r8, 8), %r9;*lmul {reexecute=0 rethrow=0 return_oop=0}
│││ ││││ │ ; - org.openjdk.bench.vm.compiler.VectorReduction2::longMaxDotProduct at 30 (line 1125)
│││ ││││ │ ; - org.openjdk.bench.vm.compiler.jmh_generated.VectorReduction2_NoSuperword_longMaxDotProduct_jmhTest::longMaxDotProduct_avgt_jmhStub at 17 (line 190)
8.46% │││ ││││ │ 0x00007f3fcc0fa728: cmpq %r9, %rsi
│││╭││││ │ 0x00007f3fcc0fa72b: jl 0x7f3fcc0fa751 ;*lreturn {reexecute=0 rethrow=0 return_oop=0}
││││││││ │ ; - java.lang.Math::max at 11 (line 2037)
││││││││ │ ; - org.openjdk.bench.vm.compiler.VectorReduction2::longMaxDotProduct at 36 (line 1126)
││││││││ │ ; - org.openjdk.bench.vm.compiler.jmh_generated.VectorReduction2_NoSuperword_longMaxDotProduct_jmhTest::longMaxDotProduct_avgt_jmhStub at 17 (line 190)
Patch transforms this into conditional moves:
11.00% │ 0x00007f66f40f70b2: movq 0x18(%r13, %rcx, 8), %rax
│ 0x00007f66f40f70b7: imulq 0x18(%r9, %rcx, 8), %rax;*lmul {reexecute=0 rethrow=0 return_oop=0}
│ ; - org.openjdk.bench.vm.compiler.VectorReduction2::longMaxDotProduct at 30 (line 1125)
│ ; - org.openjdk.bench.vm.compiler.jmh_generated.VectorReduction2_NoSuperword_longMaxDotProduct_jmhTest::longMaxDotProduct_avgt_jmhStub at 17 (line 190)
│ 0x00007f66f40f70bd: cmpq %rdx, %rax
13.07% │ 0x00007f66f40f70c0: cmovlq %rdx, %rax ;*invokestatic max {reexecute=0 rethrow=0 return_oop=0}
│ ; - org.openjdk.bench.vm.compiler.VectorReduction2::longMaxDotProduct at 36 (line 1126)
│ ; - org.openjdk.bench.vm.compiler.jmh_generated.VectorReduction2_NoSuperword_longMaxDotProduct_jmhTest::longMaxDotProduct_avgt_jmhStub at 17 (line 190)
This is similar to what we have seen above. Lacking superword functionality, the fallback for MaxL/MinL implies using conditional moves. Although branch probabilities are not controlled here, we can observe that one of the branches is likely being taken ~100% of the time.
-------------
PR Comment: https://git.openjdk.org/jdk/pull/20098#issuecomment-2642788364
More information about the core-libs-dev
mailing list