RFR: 8307513: C2: intrinsify Math.max(long,long) and Math.min(long,long) [v11]

Fri Feb 7 12:31:11 UTC 2025

On Fri, 17 Jan 2025 17:53:24 GMT, Galder Zamarreño <galder at openjdk.org> wrote:

>> This patch intrinsifies `Math.max(long, long)` and `Math.min(long, long)` in order to help improve vectorization performance.
>> 
>> Currently vectorization does not kick in for loops containing either of these calls because of the following error:
>> 
>> 
>> VLoop::check_preconditions: failed: control flow in loop not allowed
>> 
>> 
>> The control flow is due to the java implementation for these methods, e.g.
>> 
>> 
>> public static long max(long a, long b) {
>>     return (a >= b) ? a : b;
>> }
>> 
>> 
>> This patch intrinsifies the calls to replace the CmpL + Bool nodes for MaxL/MinL nodes respectively.
>> By doing this, vectorization no longer finds the control flow and so it can carry out the vectorization.
>> E.g.
>> 
>> 
>> SuperWord::transform_loop:
>>     Loop: N518/N126  counted [int,int),+4 (1025 iters)  main has_sfpt strip_mined
>>  518  CountedLoop  === 518 246 126  [[ 513 517 518 242 521 522 422 210 ]] inner stride: 4 main of N518 strip mined !orig=[419],[247],[216],[193] !jvms: Test::test @ bci:14 (line 21)
>> 
>> 
>> Applying the same changes to `ReductionPerf` as in https://github.com/openjdk/jdk/pull/13056, we can compare the results before and after. Before the patch, on darwin/aarch64 (M1):
>> 
>> 
>> ==============================
>> Test summary
>> ==============================
>>    TEST                                              TOTAL  PASS  FAIL ERROR
>>    jtreg:test/hotspot/jtreg/compiler/loopopts/superword/ReductionPerf.java
>>                                                          1     1     0     0
>> ==============================
>> TEST SUCCESS
>> 
>> long min   1155
>> long max   1173
>> 
>> 
>> After the patch, on darwin/aarch64 (M1):
>> 
>> 
>> ==============================
>> Test summary
>> ==============================
>>    TEST                                              TOTAL  PASS  FAIL ERROR
>>    jtreg:test/hotspot/jtreg/compiler/loopopts/superword/ReductionPerf.java
>>                                                          1     1     0     0
>> ==============================
>> TEST SUCCESS
>> 
>> long min   1042
>> long max   1042
>> 
>> 
>> This patch does not add an platform-specific backend implementations for the MaxL/MinL nodes.
>> Therefore, it still relies on the macro expansion to transform those into CMoveL.
>> 
>> I've run tier1 and hotspot compiler tests on darwin/aarch64 and got these results:
>> 
>> 
>> ==============================
>> Test summary
>> ==============================
>>    TEST                                              TOTAL  PA...
>
> Galder Zamarreño has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Fix typo

@eastig is helping with the results on aarch64, so I will verify the numbers in same way done below for x86_64 once he provides me with the results.

Here is a summary of the benchmarking results I'm seeing on x86_64 (I will push an update that just merges the latest master shortly).

First I will go through the results of `MinMaxVector`. This benchmark computes throughput by default so the higher the number the better.

# MinMaxVector AVX-512

Following are results with AVX-512 instructions:

Benchmark                       (probability)  (range)  (seed)  (size)   Mode  Cnt   Baseline     Patch   Units
MinMaxVector.longClippingRange            N/A       90       0    1000  thrpt    4    834.127  3688.961  ops/ms
MinMaxVector.longClippingRange            N/A      100       0    1000  thrpt    4   1147.010  3687.721  ops/ms
MinMaxVector.longLoopMax                   50      N/A     N/A    2048  thrpt    4   1126.718  1072.812  ops/ms
MinMaxVector.longLoopMax                   80      N/A     N/A    2048  thrpt    4   1070.921  1070.538  ops/ms
MinMaxVector.longLoopMax                  100      N/A     N/A    2048  thrpt    4    510.483  1073.081  ops/ms
MinMaxVector.longLoopMin                   50      N/A     N/A    2048  thrpt    4    935.658  1016.910  ops/ms
MinMaxVector.longLoopMin                   80      N/A     N/A    2048  thrpt    4   1007.410   933.774  ops/ms
MinMaxVector.longLoopMin                  100      N/A     N/A    2048  thrpt    4    536.582  1017.337  ops/ms
MinMaxVector.longReductionMax              50      N/A     N/A    2048  thrpt    4    967.288   966.945  ops/ms
MinMaxVector.longReductionMax              80      N/A     N/A    2048  thrpt    4    967.327   967.382  ops/ms
MinMaxVector.longReductionMax             100      N/A     N/A    2048  thrpt    4    849.689   967.327  ops/ms
MinMaxVector.longReductionMin              50      N/A     N/A    2048  thrpt    4    966.323   967.275  ops/ms
MinMaxVector.longReductionMin              80      N/A     N/A    2048  thrpt    4    967.340   967.228  ops/ms
MinMaxVector.longReductionMin             100      N/A     N/A    2048  thrpt    4    880.921   967.233  ops/ms

### `longReduction[Min|Max]` performance improves slightly when probability is 100

Without the patch the code uses compare instructions:

   7.83%  ││││ │││↗  │           0x00007f4f700fb305:   imulq		$0xb, 0x20(%r14, %r8, 8), %rdi
          ││││ ││││  │                                                                     ;*lmul {reexecute=0 rethrow=0 return_oop=0}
          ││││ ││││  │                                                                     ; - org.openjdk.bench.java.lang.MinMaxVector::longReductionMax at 24 (line 255)
          ││││ ││││  │                                                                     ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longReductionMax_jmhTest::longReductionMax_thrpt_jmhStub at 19 (line 124)
   5.64%  ││││ ││││  │           0x00007f4f700fb30b:   cmpq		%rdi, %rdx
          ││││╭││││  │           0x00007f4f700fb30e:   jge		0x7f4f700fb32c      ;*lreturn {reexecute=0 rethrow=0 return_oop=0}
          │││││││││  │                                                                     ; - java.lang.Math::max at 11 (line 2037)
          │││││││││  │                                                                     ; - org.openjdk.bench.java.lang.MinMaxVector::longReductionMax at 30 (line 256)
          │││││││││  │                                                                     ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longReductionMax_jmhTest::longReductionMax_thrpt_jmhStub at 19 (line 124)
  12.82%  │││││││││↗ │           0x00007f4f700fb310:   imulq		$0xb, 0x28(%r14, %r8, 8), %rbp
          ││││││││││ │                                                                     ;*lmul {reexecute=0 rethrow=0 return_oop=0}
          ││││││││││ │                                                                     ; - org.openjdk.bench.java.lang.MinMaxVector::longReductionMax at 24 (line 255)
          ││││││││││ │                                                                     ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longReductionMax_jmhTest::longReductionMax_thrpt_jmhStub at 19 (line 124)
   7.46%  ││││││││││ │           0x00007f4f700fb316:   cmpq		%rbp, %rdi
          │││││╰││││ │           0x00007f4f700fb319:   jl		0x7f4f700fb2e0      ;*iflt {reexecute=0 rethrow=0 return_oop=0}
          │││││ ││││ │                                                                     ; - java.lang.Math::max at 3 (line 2037)
          │││││ ││││ │                                                                     ; - org.openjdk.bench.java.lang.MinMaxVector::longReductionMax at 30 (line 256)
          │││││ ││││ │                                                                     ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longReductionMax_jmhTest::longReductionMax_thrpt_jmhStub at 19 (line 124)

And with the patch these become vectorized:

          │    ││ ↗││││  0x00007f56280fad10:   vpmullq		0xf0(%rdx, %rsi, 8), %ymm10, %ymm4
   8.35%  │    ││ │││││  0x00007f56280fad1b:   vpmullq		0xd0(%rdx, %rsi, 8), %ymm10, %ymm5
   4.27%  │    ││ │││││  0x00007f56280fad26:   vpmullq		0x10(%rdx, %rsi, 8), %ymm10, %ymm6
          │    ││ │││││                                                            ;   {no_reloc}
   4.22%  │    ││ │││││  0x00007f56280fad31:   vpmullq		0x30(%rdx, %rsi, 8), %ymm10, %ymm7
   4.00%  │    ││ │││││  0x00007f56280fad3c:   vpmullq		0xb0(%rdx, %rsi, 8), %ymm10, %ymm8
   4.13%  │    ││ │││││  0x00007f56280fad47:   vpmullq		0x50(%rdx, %rsi, 8), %ymm10, %ymm11
   4.10%  │    ││ │││││  0x00007f56280fad52:   vpmullq		0x70(%rdx, %rsi, 8), %ymm10, %ymm12
   4.13%  │    ││ │││││  0x00007f56280fad5d:   vpmullq		0x90(%rdx, %rsi, 8), %ymm10, %ymm13
   4.03%  │    ││ │││││  0x00007f56280fad68:   vpmaxsq		%ymm6, %ymm3, %ymm3
          │    ││ │││││  0x00007f56280fad6e:   vpmaxsq		%ymm7, %ymm3, %ymm3
   4.72%  │    ││ │││││  0x00007f56280fad74:   vpmaxsq		%ymm11, %ymm3, %ymm3
          │    ││ │││││  0x00007f56280fad7a:   vpmaxsq		%ymm12, %ymm3, %ymm3
   8.40%  │    ││ │││││  0x00007f56280fad80:   vpmaxsq		%ymm13, %ymm3, %ymm3
  23.11%  │    ││ │││││  0x00007f56280fad86:   vpmaxsq		%ymm8, %ymm3, %ymm3
   2.15%  │    ││ │││││  0x00007f56280fad8c:   vpmaxsq		%ymm5, %ymm3, %ymm3
   8.79%  │    ││ │││││  0x00007f56280fad92:   vpmaxsq		%ymm4, %ymm3, %ymm3 ;*invokestatic max {reexecute=0 rethrow=0 return_oop=0}
          │    ││ │││││                                                            ; - org.openjdk.bench.java.lang.MinMaxVector::longReductionMax at 30 (line 256)
          │    ││ │││││                                                            ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longReductionMax_jmhTest::longReductionMax_thrpt_jmhStub at 19 (line 124)

### `longLoop[Min|Max]` performance improves considerably when probability is 100

Without the patch the code uses compare + move instructions:

   4.53%  ││││  ││  │ │           0x00007f96b40faf33:   movq		0x18(%rax, %rsi, 8), %r13;*laload {reexecute=0 rethrow=0 return_oop=0}
          ││││  ││  │ │                                                                     ; - org.openjdk.bench.java.lang.MinMaxVector::longLoopMax at 20 (line 236)
          ││││  ││  │ │                                                                     ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longLoopMax_jmhTest::longLoopMax_thrpt_jmhStub at 19 (line 124)
   2.69%  ││││  ││  │ │           0x00007f96b40faf38:   cmpq		%r11, %r13
          ││││╭ ││  │ │           0x00007f96b40faf3b:   jl		0x7f96b40faf67      ;*lreturn {reexecute=0 rethrow=0 return_oop=0}
          │││││ ││  │ │                                                                     ; - java.lang.Math::max at 11 (line 2037)
          │││││ ││  │ │                                                                     ; - org.openjdk.bench.java.lang.MinMaxVector::longLoopMax at 27 (line 236)
          │││││ ││  │ │                                                                     ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longLoopMax_jmhTest::longLoopMax_thrpt_jmhStub at 19 (line 124)
   8.75%  │││││ ││↗ │ │           0x00007f96b40faf3d:   movq		%r13, 0x18(%rbp, %rsi, 8);*lastore {reexecute=0 rethrow=0 return_oop=0}
          │││││ │││ │ │                                                                     ; - org.openjdk.bench.java.lang.MinMaxVector::longLoopMax at 30 (line 236)
          │││││ │││ │ │                                                                     ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longLoopMax_jmhTest::longLoopMax_thrpt_jmhStub at 19 (line 124)

And with the patch those become vectorized:

   3.55%  │  ││  0x00007f13c80fa18a:   vmovdqu		0xf0(%rbx, %r10, 8), %ymm5
          │  ││  0x00007f13c80fa194:   vmovdqu		0xf0(%rdi, %r10, 8), %ymm6
   2.35%  │  ││  0x00007f13c80fa19e:   vpmaxsq		%ymm6, %ymm5, %ymm5
   5.03%  │  ││  0x00007f13c80fa1a4:   vmovdqu		%ymm5, 0xf0(%rax, %r10, 8)
          │  ││                                                            ;*lastore {reexecute=0 rethrow=0 return_oop=0}
          │  ││                                                            ; - org.openjdk.bench.java.lang.MinMaxVector::longLoopMax at 30 (line 236)
          │  ││                                                            ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longLoopMax_jmhTest::longLoopMax_thrpt_jmhStub at 19 (line 124)

It's interesting to observe that at probabilites of 50/80% the baseline performs better than at 100%. The reason for that is because at 50/80% the baseline already vectorizes. So, why isn't the baseline vectorizing at 100% probability?

VLoop::check_preconditions
      Loop: N1256/N463  limit_check counted [int,int),+4 (3161 iters)  main rc  has_sfpt strip_mined
 1256  CountedLoop  === 1256 598 463  [[ 1256 1257 1271 1272 ]] inner stride: 4 main of N1256 strip mined !orig=[1126],[599],[590],[307] !jvms: MinMaxVector::longLoopMax @ bci:10 (line 236) MinMaxVector_longLoopMax_jmhTest::longLoopMax_thrpt_jmhStub @ bci:19 (line 124)
VLoop::check_preconditions: fails because of control flow.
  cl_exit 594  594  CountedLoopEnd  === 415 593  [[ 1275 463 ]] [lt] P=0.999684, C=707717.000000 !orig=[462] !jvms: MinMaxVector::longLoopMax @ bci:7 (line 235) MinMaxVector_longLoopMax_jmhTest::longLoopMax_thrpt_jmhStub @ bci:19 (line 124)
  cl_exit->in(0) 415  415  Region  === 415 411 412  [[ 415 594 416 451 ]]  !orig=[423] !jvms: Math::max @ bci:11 (line 2037) MinMaxVector::longLoopMax @ bci:27 (line 236) MinMaxVector_longLoopMax_jmhTest::longLoopMax_thrpt_jmhStub @ bci:19 (line 124)
  lpt->_head 1256 1256  CountedLoop  === 1256 598 463  [[ 1256 1257 1271 1272 ]] inner stride: 4 main of N1256 strip mined !orig=[1126],[599],[590],[307] !jvms: MinMaxVector::longLoopMax @ bci:10 (line 236) MinMaxVector_longLoopMax_jmhTest::longLoopMax_thrpt_jmhStub @ bci:19 (line 124)
      Loop: N1256/N463  limit_check counted [int,int),+4 (3161 iters)  main rc  has_sfpt strip_mined
VLoop::check_preconditions: failed: control flow in loop not allowed

At 100% probability baseline fails to vectorize because it observes a control flow. This control flow is not the one you see in min/max implementations, but this is one added by HotSpot as a result of the JIT profiling. It observes that one branch is always taken so it optimizes for that, and adds a branch for the uncommon case where the branch is not taken.

### `longClippingRange` performance improves considerably

Without the patch the code uses compare + move instructions:

   3.39%  ││ │      ││ │            0x00007febb40fb175:   cmpq		%rbp, %rcx
          ││ │╭     ││ │            0x00007febb40fb178:   jge		0x7febb40fb17d      ;*iflt {reexecute=0 rethrow=0 return_oop=0}
          ││ ││     ││ │                                                                      ; - java.lang.Math::max at 3 (line 2037)
          ││ ││     ││ │                                                                      ; - org.openjdk.bench.java.lang.MinMaxVector::longClippingRange at 25 (line 220)
          ││ ││     ││ │                                                                      ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longClippingRange_jmhTest::longClippingRange_thrpt_jmhStub at 19 (line 124)
   2.69%  ││ ││     ││ │            0x00007febb40fb17a:   movq		%rbp, %rcx          ;*lreturn {reexecute=0 rethrow=0 return_oop=0}
          ││ ││     ││ │                                                                      ; - java.lang.Math::max at 11 (line 2037)
          ││ ││     ││ │                                                                      ; - org.openjdk.bench.java.lang.MinMaxVector::longClippingRange at 25 (line 220)
          ││ ││     ││ │                                                                      ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longClippingRange_jmhTest::longClippingRange_thrpt_jmhStub at 19 (line 124)
   4.35%  ││ │↘     ││ │            0x00007febb40fb17d:   nop
   2.93%  ││ │      ││ │            0x00007febb40fb180:   cmpq		%r8, %rcx
          ││ │ ╭    ││ │            0x00007febb40fb183:   jle		0x7febb40fb188      ;*ifgt {reexecute=0 rethrow=0 return_oop=0}
          ││ │ │    ││ │                                                                      ; - java.lang.Math::min at 3 (line 2132)
          ││ │ │    ││ │                                                                      ; - org.openjdk.bench.java.lang.MinMaxVector::longClippingRange at 32 (line 220)
          ││ │ │    ││ │                                                                      ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longClippingRange_jmhTest::longClippingRange_thrpt_jmhStub at 19 (line 124)
   3.51%  ││ │ │    ││ │            0x00007febb40fb185:   movq		%r8, %rcx           ;*lreturn {reexecute=0 rethrow=0 return_oop=0}
          ││ │ │    ││ │                                                                      ; - java.lang.Math::min at 11 (line 2132)
          ││ │ │    ││ │                                                                      ; - org.openjdk.bench.java.lang.MinMaxVector::longClippingRange at 32 (line 220)
          ││ │ │    ││ │                                                                      ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longClippingRange_jmhTest::longClippingRange_thrpt_jmhStub at 19 (line 124)
   4.26%  ││ │ ↘    ││ │            0x00007febb40fb188:   movq		%rcx, 0x10(%rsi, %r9, 8);*lastore {reexecute=0 rethrow=0 return_oop=0}
          ││ │      ││ │                                                                      ; - org.openjdk.bench.java.lang.MinMaxVector::longClippingRange at 35 (line 220)
          ││ │      ││ │                                                                      ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longClippingRange_jmhTest::longClippingRange_thrpt_jmhStub at 19 (line 124)

With the patch these become vectorized:

   0.20%  ││↗        ↗   0x00007f10180fd15c:   vmovdqu		0x10(%r11, %rcx, 8), %ymm6
          │││        │   0x00007f10180fd163:   vpmaxsq		%ymm6, %ymm7, %ymm6
          │││        │   0x00007f10180fd169:   vpminsq		%ymm8, %ymm6, %ymm6
          │││        │   0x00007f10180fd16f:   vmovdqu		%ymm6, 0x10(%r8, %rcx, 8);*lastore {reexecute=0 rethrow=0 return_oop=0}
          │││        │                                                             ; - org.openjdk.bench.java.lang.MinMaxVector::longClippingRange at 35 (line 220)
          │││        │                                                             ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longClippingRange_jmhTest::longClippingRange_thrpt_jmhStub at 19 (line 124)

# `MinMaxVector` AVX2

Following are results on the same machine as above but forcing AVX2 to be used instead of AVX-512:

Benchmark                       (probability)  (range)  (seed)  (size)   Mode  Cnt  Baseline     Patch   Units
MinMaxVector.longClippingRange            N/A       90       0    1000  thrpt    4   832.132  1813.609  ops/ms
MinMaxVector.longClippingRange            N/A      100       0    1000  thrpt    4   832.546  1814.477  ops/ms
MinMaxVector.longLoopMax                   50      N/A     N/A    2048  thrpt    4   938.372   939.313  ops/ms
MinMaxVector.longLoopMax                   80      N/A     N/A    2048  thrpt    4   934.964   945.124  ops/ms
MinMaxVector.longLoopMax                  100      N/A     N/A    2048  thrpt    4   512.076   937.287  ops/ms
MinMaxVector.longLoopMin                   50      N/A     N/A    2048  thrpt    4   999.455   689.750  ops/ms
MinMaxVector.longLoopMin                   80      N/A     N/A    2048  thrpt    4  1000.352   876.326  ops/ms
MinMaxVector.longLoopMin                  100      N/A     N/A    2048  thrpt    4   536.359   999.475  ops/ms
MinMaxVector.longReductionMax              50      N/A     N/A    2048  thrpt    4   409.413   409.363  ops/ms
MinMaxVector.longReductionMax              80      N/A     N/A    2048  thrpt    4   409.374   409.141  ops/ms
MinMaxVector.longReductionMax             100      N/A     N/A    2048  thrpt    4   883.614   409.318  ops/ms
MinMaxVector.longReductionMin              50      N/A     N/A    2048  thrpt    4   404.723   404.705  ops/ms
MinMaxVector.longReductionMin              80      N/A     N/A    2048  thrpt    4   404.755   404.748  ops/ms
MinMaxVector.longReductionMin             100      N/A     N/A    2048  thrpt    4   848.784   404.669  ops/ms

### `longClippingRange` performance improves considerably

Baseline uses compare + move instructions as shown above. But the patched version improves in spite of not being able to use AVX-512 instructions such as `vpmaxsq`. The performance improvements come from using other vectorized compare + vectorized move instructions:

          │    │   ││││  0x00007f9aa40f94ac:   vpcmpgtq		%ymm6, %ymm7, %ymm12
   3.79%  │    │   ││││  0x00007f9aa40f94b1:   vblendvpd		%ymm12, %ymm7, %ymm6, %ymm12
   3.72%  │    │   ││││  0x00007f9aa40f94b7:   vpcmpgtq		%ymm8, %ymm12, %ymm10
          │    │   ││││  0x00007f9aa40f94bc:   vblendvpd		%ymm10, %ymm8, %ymm12, %ymm10
   3.78%  │    │   ││││  0x00007f9aa40f94c2:   vmovdqu		%ymm10, 0xf0(%r8, %rcx, 8)
          │    │   ││││                                                            ;*lastore {reexecute=0 rethrow=0 return_oop=0}
          │    │   ││││                                                            ; - org.openjdk.bench.java.lang.MinMaxVector::longClippingRange at 35 (line 220)
          │    │   ││││                                                            ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longClippingRange_jmhTest::longClippingRange_thrpt_jmhStub at 19 (line 124)

### `longReduction[Min|Max]` performance drops considerably when probability is 100

Baseline uses compare + move instruction to implement this:

          ││││ ││││  │                                                                     ;*lmul {reexecute=0 rethrow=0 return_oop=0}
          ││││ ││││  │                                                                     ; - org.openjdk.bench.java.lang.MinMaxVector::longReductionMax at 24 (line 255)
          ││││ ││││  │                                                                     ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longReductionMax_jmhTest::longReductionMax_thrpt_jmhStub at 19 (line 124)
   6.30%  ││││ ││││  │           0x00007fd5580f678b:   cmpq		%rdi, %rdx
          ││││╭││││  │           0x00007fd5580f678e:   jge		0x7fd5580f67ac      ;*lreturn {reexecute=0 rethrow=0 return_oop=0}
          │││││││││  │                                                                     ; - java.lang.Math::max at 11 (line 2037)
          │││││││││  │                                                                     ; - org.openjdk.bench.java.lang.MinMaxVector::longReductionMax at 30 (line 256)
          │││││││││  │                                                                     ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longReductionMax_jmhTest::longReductionMax_thrpt_jmhStub at 19 (line 124)
  12.88%  │││││││││↗ │           0x00007fd5580f6790:   imulq		$0xb, 0x28(%r14, %r8, 8), %rbp
          ││││││││││ │                                                                     ;*lmul {reexecute=0 rethrow=0 return_oop=0}
          ││││││││││ │                                                                     ; - org.openjdk.bench.java.lang.MinMaxVector::longReductionMax at 24 (line 255)
          ││││││││││ │                                                                     ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longReductionMax_jmhTest::longReductionMax_thrpt_jmhStub at 19 (line 124)
   7.55%  ││││││││││ │           0x00007fd5580f6796:   cmpq		%rbp, %rdi
          │││││╰││││ │           0x00007fd5580f6799:   jl		0x7fd5580f6760      ;*iflt {reexecute=0 rethrow=0 return_oop=0}
          │││││ ││││ │                                                                     ; - java.lang.Math::max at 3 (line 2037)
          │││││ ││││ │                                                                     ; - org.openjdk.bench.java.lang.MinMaxVector::longReductionMax at 30 (line 256)
          │││││ ││││ │                                                                     ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longReductionMax_jmhTest::longReductionMax_thrpt_jmhStub at 19 (line 124)

With the patch the code uses conditional moves instead:

   0.05%  ↗│  0x00007fc4700f5253:   imulq		$0xb, 0x28(%r14, %r11, 8), %rdx
  10.62%  ││  0x00007fc4700f5259:   imulq		$0xb, 0x20(%r14, %r11, 8), %rax
   0.63%  ││  0x00007fc4700f525f:   imulq		$0xb, 0x10(%r14, %r11, 8), %r8
          ││                                                            ;*lmul {reexecute=0 rethrow=0 return_oop=0}
          ││                                                            ; - org.openjdk.bench.java.lang.MinMaxVector::longReductionMax at 24 (line 255)
          ││                                                            ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longReductionMax_jmhTest::longReductionMax_thrpt_jmhStub at 19 (line 124)
  10.34%  ││  0x00007fc4700f5265:   cmpq		%r8, %r13
   2.37%  ││  0x00007fc4700f5268:   cmovlq		%r8, %r13           ;*invokestatic max {reexecute=0 rethrow=0 return_oop=0}
          ││                                                            ; - org.openjdk.bench.java.lang.MinMaxVector::longReductionMax at 30 (line 256)
          ││                                                            ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longReductionMax_jmhTest::longReductionMax_thrpt_jmhStub at 19 (line 124)
   1.15%  ││  0x00007fc4700f526c:   imulq		$0xb, 0x18(%r14, %r11, 8), %r8
          ││                                                            ;*lmul {reexecute=0 rethrow=0 return_oop=0}
          ││                                                            ; - org.openjdk.bench.java.lang.MinMaxVector::longReductionMax at 24 (line 255)
          ││                                                            ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longReductionMax_jmhTest::longReductionMax_thrpt_jmhStub at 19 (line 124)
   9.28%  ││  0x00007fc4700f5272:   cmpq		%r8, %r13
   3.82%  ││  0x00007fc4700f5275:   cmovlq		%r8, %r13
  21.61%  ││  0x00007fc4700f5279:   cmpq		%rax, %r13
  11.55%  ││  0x00007fc4700f527c:   cmovlq		%rax, %r13
   4.48%  ││  0x00007fc4700f5280:   cmpq		%rdx, %r13
  11.76%  ││  0x00007fc4700f5283:   cmovlq		%rdx, %r13          ;*invokestatic max {reexecute=0 rethrow=0 return_oop=0}
          ││                                                            ; - org.openjdk.bench.java.lang.MinMaxVector::longReductionMax at 30 (line 256)
          ││                                                            ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longReductionMax_jmhTest::longReductionMax_thrpt_jmhStub at 19 (line 124)

When one of the branches is taken always or almost always, the branched code of baseline can be optimized with branch prediction. However, the conditional move instructions force the CPU to compute both sides of the branch, so it performs worse in this scenario.

Why vectorized instructions are not used in this scenario? Vector instructions for min/max are not available with AVX2 and the trace vectorization signals it:

PackSet::print: 3 packs
 Pack: 0
    0:  1119  LoadL  === 1105 343 1120  [[ 1117 ]]  @long[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact+any *, idx=8; #long (does not depend only on test, unknown control) !orig=997,663,[457] !jvms: MinMaxVector::longReductionMax @ bci:23 (line 255) MinMaxVector_longReductionMax_jmhTest::longReductionMax_thrpt_jmhStub @ bci:19 (line 124)
    1:  1112  LoadL  === 1105 343 1113  [[ 1111 ]]  @long[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact+any *, idx=8; #long (does not depend only on test, unknown control) !orig=663,[457] !jvms: MinMaxVector::longReductionMax @ bci:23 (line 255) MinMaxVector_longReductionMax_jmhTest::longReductionMax_thrpt_jmhStub @ bci:19 (line 124)
    2:   997  LoadL  === 1105 343 998  [[ 996 ]]  @long[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact+any *, idx=8; #long (does not depend only on test, unknown control) !orig=663,[457] !jvms: MinMaxVector::longReductionMax @ bci:23 (line 255) MinMaxVector_longReductionMax_jmhTest::longReductionMax_thrpt_jmhStub @ bci:19 (line 124)
    3:   663  LoadL  === 1105 343 455  [[ 458 ]]  @long[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact+any *, idx=8; #long (does not depend only on test, unknown control) !orig=[457] !jvms: MinMaxVector::longReductionMax @ bci:23 (line 255) MinMaxVector_longReductionMax_jmhTest::longReductionMax_thrpt_jmhStub @ bci:19 (line 124)
 Pack: 1
    0:  1117  MulL  === _ 1119 162  [[ 1116 ]]  !orig=996,458 !jvms: MinMaxVector::longReductionMax @ bci:24 (line 255) MinMaxVector_longReductionMax_jmhTest::longReductionMax_thrpt_jmhStub @ bci:19 (line 124)
    1:  1111  MulL  === _ 1112 162  [[ 1110 ]]  !orig=458 !jvms: MinMaxVector::longReductionMax @ bci:24 (line 255) MinMaxVector_longReductionMax_jmhTest::longReductionMax_thrpt_jmhStub @ bci:19 (line 124)
    2:   996  MulL  === _ 997 162  [[ 995 ]]  !orig=458 !jvms: MinMaxVector::longReductionMax @ bci:24 (line 255) MinMaxVector_longReductionMax_jmhTest::longReductionMax_thrpt_jmhStub @ bci:19 (line 124)
    3:   458  MulL  === _ 663 162  [[ 459 ]]  !jvms: MinMaxVector::longReductionMax @ bci:24 (line 255) MinMaxVector_longReductionMax_jmhTest::longReductionMax_thrpt_jmhStub @ bci:19 (line 124)
 Pack: 2
    0:  1116  MaxL  === _ 1128 1117  [[ 1110 ]]  !orig=995,459,1012 !jvms: MinMaxVector::longReductionMax @ bci:30 (line 256) MinMaxVector_longReductionMax_jmhTest::longReductionMax_thrpt_jmhStub @ bci:19 (line 124)
    1:  1110  MaxL  === _ 1116 1111  [[ 995 ]]  !orig=459,1012 !jvms: MinMaxVector::longReductionMax @ bci:30 (line 256) MinMaxVector_longReductionMax_jmhTest::longReductionMax_thrpt_jmhStub @ bci:19 (line 124)
    2:   995  MaxL  === _ 1110 996  [[ 459 ]]  !orig=459,1012 !jvms: MinMaxVector::longReductionMax @ bci:30 (line 256) MinMaxVector_longReductionMax_jmhTest::longReductionMax_thrpt_jmhStub @ bci:19 (line 124)
    3:   459  MaxL  === _ 995 458  [[ 1128 923 570 ]]  !orig=1012 !jvms: MinMaxVector::longReductionMax @ bci:30 (line 256) MinMaxVector_longReductionMax_jmhTest::longReductionMax_thrpt_jmhStub @ bci:19 (line 124)

WARNING: Removed pack: not implemented at any smaller size:
    0:  1116  MaxL  === _ 1128 1117  [[ 1110 ]]  !orig=995,459,1012 !jvms: MinMaxVector::longReductionMax @ bci:30 (line 256) MinMaxVector_longReductionMax_jmhTest::longReductionMax_thrpt_jmhStub @ bci:19 (line 124)
    1:  1110  MaxL  === _ 1116 1111  [[ 995 ]]  !orig=459,1012 !jvms: MinMaxVector::longReductionMax @ bci:30 (line 256) MinMaxVector_longReductionMax_jmhTest::longReductionMax_thrpt_jmhStub @ bci:19 (line 124)
    2:   995  MaxL  === _ 1110 996  [[ 459 ]]  !orig=459,1012 !jvms: MinMaxVector::longReductionMax @ bci:30 (line 256) MinMaxVector_longReductionMax_jmhTest::longReductionMax_thrpt_jmhStub @ bci:19 (line 124)
    3:   459  MaxL  === _ 995 458  [[ 1128 923 570 ]]  !orig=1012 !jvms: MinMaxVector::longReductionMax @ bci:30 (line 256) MinMaxVector_longReductionMax_jmhTest::longReductionMax_thrpt_jmhStub @ bci:19 (line 124)

After SuperWord::split_packs_only_implemented_with_smaller_size

One interesting question option to explore here would be if MaxL/MinL could be implemented in terms of vectorized compare instructions, as shown above in the `longClippingRange` scenario. Thoughts @rwestrel @eme64?

# `VectorReduction2.WithSuperword` on AVX-512 machine

As requested by Emanuel I've also run this benchmark. Note that the results here are time per op, so the lower the number the better:

Benchmark                                         (SIZE)  (seed)  Mode  Cnt  Baseline     Patch  Units
VectorReduction2.WithSuperword.longMaxBig           2048       0  avgt    3  3970.527  1918.821  ns/op
VectorReduction2.WithSuperword.longMaxDotProduct    2048       0  avgt    3  1369.634  1055.762  ns/op
VectorReduction2.WithSuperword.longMaxSimple        2048       0  avgt    3   722.314  2172.064  ns/op
VectorReduction2.WithSuperword.longMinBig           2048       0  avgt    3  3996.694  1918.398  ns/op
VectorReduction2.WithSuperword.longMinDotProduct    2048       0  avgt    3  1363.687  1056.375  ns/op
VectorReduction2.WithSuperword.longMinSimple        2048       0  avgt    3   718.150  2179.478  ns/op

`long[Min|Max]Big` and `long[Min|Max]DotProduct` benchmarks show considerable improvements,
but something odd is happening in `long[Min|Max]Simple`.

### `long[Min|Max]Simple` performance drops considerably

Baseline uses compare + moves instructions:

   8.05%  ││      ││↗       │    0x00007f9d580f569b:   movq		0x18(%r13, %r11, 8), %r8;*laload {reexecute=0 rethrow=0 return_oop=0}
          ││      │││       │                                                              ; - org.openjdk.bench.vm.compiler.VectorReduction2::longMaxSimple at 22 (line 1054)
          ││      │││       │                                                              ; - org.openjdk.bench.vm.compiler.jmh_generated.VectorReduction2_WithSuperword_longMaxSimple_jmhTest::longMaxSimple_avgt_jmhStub at 17 (line 190)
   0.23%  ││      │││       │    0x00007f9d580f56a0:   cmpq		%r8, %rsi
          ││╭     │││       │    0x00007f9d580f56a3:   jl		0x7f9d580f5713      ;*lreturn {reexecute=0 rethrow=0 return_oop=0}
          │││     │││       │                                                              ; - java.lang.Math::max at 11 (line 2037)
          │││     │││       │                                                              ; - org.openjdk.bench.vm.compiler.VectorReduction2::longMaxSimple at 28 (line 1055)
          │││     │││       │                                                              ; - org.openjdk.bench.vm.compiler.jmh_generated.VectorReduction2_WithSuperword_longMaxSimple_jmhTest::longMaxSimple_avgt_jmhStub at 17 (line 190)

Patched version uses conditional moves instead of vectorized instructions:

   2.76%  ││    0x00007fcd180f695c:   movq		0x18(%r14, %r11, 8), %rdi;*laload {reexecute=0 rethrow=0 return_oop=0}
          ││                                                              ; - org.openjdk.bench.vm.compiler.VectorReduction2::longMaxSimple at 22 (line 1054)
          ││                                                              ; - org.openjdk.bench.vm.compiler.jmh_generated.VectorReduction2_WithSuperword_longMaxSimple_jmhTest::longMaxSimple_avgt_jmhStub at 17 (line 190)
          ││    0x00007fcd180f6961:   cmpq		%rdi, %r13
   3.11%  ││    0x00007fcd180f6964:   cmovlq		%rdi, %r13          ;*invokestatic max {reexecute=0 rethrow=0 return_oop=0}
          ││                                                              ; - org.openjdk.bench.vm.compiler.VectorReduction2::longMaxSimple at 28 (line 1055)
          ││                                                              ; - org.openjdk.bench.vm.compiler.jmh_generated.VectorReduction2_WithSuperword_longMaxSimple_jmhTest::longMaxSimple_avgt_jmhStub at 17 (line 190)

Why are vectorized instructions not kicking in with patch? Because superword doesn't think it's profitable to vectorize this:

PackSet::print: 2 packs
 Pack: 0
    0:  733  LoadL  === 721 184 734  [[ 732 ]]  @long[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact+any *, idx=8; #long (does not depend only on test, unknown control) !orig=669,500,[319] !jvms: VectorReduction2::longMaxSimple @ bci:22 (line 1054) VectorReduction2_WithSuperword_longMaxSimple_jmhTest::longMaxSimple_avgt_jmhStub @ bci:17 (line 190)
    1:  728  LoadL  === 721 184 729  [[ 727 ]]  @long[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact+any *, idx=8; #long (does not depend only on test, unknown control) !orig=500,[319] !jvms: VectorReduction2::longMaxSimple @ bci:22 (line 1054) VectorReduction2_WithSuperword_longMaxSimple_jmhTest::longMaxSimple_avgt_jmhStub @ bci:17 (line 190)
    2:  669  LoadL  === 721 184 670  [[ 668 ]]  @long[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact+any *, idx=8; #long (does not depend only on test, unknown control) !orig=500,[319] !jvms: VectorReduction2::longMaxSimple @ bci:22 (line 1054) VectorReduction2_WithSuperword_longMaxSimple_jmhTest::longMaxSimple_avgt_jmhStub @ bci:17 (line 190)
    3:  500  LoadL  === 721 184 317  [[ 320 ]]  @long[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact+any *, idx=8; #long (does not depend only on test, unknown control) !orig=[319] !jvms: VectorReduction2::longMaxSimple @ bci:22 (line 1054) VectorReduction2_WithSuperword_longMaxSimple_jmhTest::longMaxSimple_avgt_jmhStub @ bci:17 (line 190)
 Pack: 1
    0:  732  MaxL  === _ 743 733  [[ 727 ]]  !orig=668,320,685 !jvms: VectorReduction2::longMaxSimple @ bci:28 (line 1055) VectorReduction2_WithSuperword_longMaxSimple_jmhTest::longMaxSimple_avgt_jmhStub @ bci:17 (line 190)
    1:  727  MaxL  === _ 732 728  [[ 668 ]]  !orig=320,685 !jvms: VectorReduction2::longMaxSimple @ bci:28 (line 1055) VectorReduction2_WithSuperword_longMaxSimple_jmhTest::longMaxSimple_avgt_jmhStub @ bci:17 (line 190)
    2:  668  MaxL  === _ 727 669  [[ 320 ]]  !orig=320,685 !jvms: VectorReduction2::longMaxSimple @ bci:28 (line 1055) VectorReduction2_WithSuperword_longMaxSimple_jmhTest::longMaxSimple_avgt_jmhStub @ bci:17 (line 190)
    3:  320  MaxL  === _ 668 500  [[ 743 593 456 ]]  !orig=685 !jvms: VectorReduction2::longMaxSimple @ bci:28 (line 1055) VectorReduction2_WithSuperword_longMaxSimple_jmhTest::longMaxSimple_avgt_jmhStub @ bci:17 (line 190)

WARNING: Removed pack: not profitable:
    0:  732  MaxL  === _ 743 733  [[ 727 ]]  !orig=668,320,685 !jvms: VectorReduction2::longMaxSimple @ bci:28 (line 1055) VectorReduction2_WithSuperword_longMaxSimple_jmhTest::longMaxSimple_avgt_jmhStub @ bci:17 (line 190)
    1:  727  MaxL  === _ 732 728  [[ 668 ]]  !orig=320,685 !jvms: VectorReduction2::longMaxSimple @ bci:28 (line 1055) VectorReduction2_WithSuperword_longMaxSimple_jmhTest::longMaxSimple_avgt_jmhStub @ bci:17 (line 190)
    2:  668  MaxL  === _ 727 669  [[ 320 ]]  !orig=320,685 !jvms: VectorReduction2::longMaxSimple @ bci:28 (line 1055) VectorReduction2_WithSuperword_longMaxSimple_jmhTest::longMaxSimple_avgt_jmhStub @ bci:17 (line 190)
    3:  320  MaxL  === _ 668 500  [[ 743 593 456 ]]  !orig=685 !jvms: VectorReduction2::longMaxSimple @ bci:28 (line 1055) VectorReduction2_WithSuperword_longMaxSimple_jmhTest::longMaxSimple_avgt_jmhStub @ bci:17 (line 190)

WARNING: Removed pack: not profitable:
    0:  733  LoadL  === 721 184 734  [[ 732 ]]  @long[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact+any *, idx=8; #long (does not depend only on test, unknown control) !orig=669,500,[319] !jvms: VectorReduction2::longMaxSimple @ bci:22 (line 1054) VectorReduction2_WithSuperword_longMaxSimple_jmhTest::longMaxSimple_avgt_jmhStub @ bci:17 (line 190)
    1:  728  LoadL  === 721 184 729  [[ 727 ]]  @long[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact+any *, idx=8; #long (does not depend only on test, unknown control) !orig=500,[319] !jvms: VectorReduction2::longMaxSimple @ bci:22 (line 1054) VectorReduction2_WithSuperword_longMaxSimple_jmhTest::longMaxSimple_avgt_jmhStub @ bci:17 (line 190)
    2:  669  LoadL  === 721 184 670  [[ 668 ]]  @long[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact+any *, idx=8; #long (does not depend only on test, unknown control) !orig=500,[319] !jvms: VectorReduction2::longMaxSimple @ bci:22 (line 1054) VectorReduction2_WithSuperword_longMaxSimple_jmhTest::longMaxSimple_avgt_jmhStub @ bci:17 (line 190)
    3:  500  LoadL  === 721 184 317  [[ 320 ]]  @long[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact+any *, idx=8; #long (does not depend only on test, unknown control) !orig=[319] !jvms: VectorReduction2::longMaxSimple @ bci:22 (line 1054) VectorReduction2_WithSuperword_longMaxSimple_jmhTest::longMaxSimple_avgt_jmhStub @ bci:17 (line 190)

After Superword::filter_packs_for_profitable

PackSet::print: 0 packs

SuperWord::transform_loop failed: SuperWord::SLP_extract did not vectorize

How can you make it vectorize? By doing something with the value in the array before passing it to min/max. That is what `MinMaxVector.longReduction[Min|Max]` and `VectorReduction2.long[Min|Max]DotProduct` methods do.

# `VectorReduction2.NoSuperword` on AVX-512 machine

Benchmark                                       (SIZE)  (seed)  Mode  Cnt  Baseline     Patch  Units
VectorReduction2.NoSuperword.longMaxBig           2048       0  avgt    3  3964.403  2966.258  ns/op
VectorReduction2.NoSuperword.longMaxDotProduct    2048       0  avgt    3  1686.373  2462.876  ns/op
VectorReduction2.NoSuperword.longMaxSimple        2048       0  avgt    3   722.219  2171.859  ns/op
VectorReduction2.NoSuperword.longMinBig           2048       0  avgt    3  3994.685  2971.143  ns/op
VectorReduction2.NoSuperword.longMinDotProduct    2048       0  avgt    3  1366.291  2428.173  ns/op
VectorReduction2.NoSuperword.longMinSimple        2048       0  avgt    3   719.218  2179.546  ns/op

Performance improves or `long[Min|Max]Big`. `long[Min|Max]Simple` suffers similar issues as shown in previous section because when not vectorized, these benchmarks fallback on conditional moves. The drop in performance in `long[Min|Max]DotProduct` needs some explanation.

### `long[Min|Max]DotProduct` performance drops considerably

Baseline uses compare + move instructions here:

   5.67%  │││ │││↗  │    0x00007f3fcc0fa71d:   movq		0x20(%r14, %r8, 8), %r9
   5.19%  │││ ││││  │    0x00007f3fcc0fa722:   imulq		0x20(%rax, %r8, 8), %r9;*lmul {reexecute=0 rethrow=0 return_oop=0}
          │││ ││││  │                                                              ; - org.openjdk.bench.vm.compiler.VectorReduction2::longMaxDotProduct at 30 (line 1125)
          │││ ││││  │                                                              ; - org.openjdk.bench.vm.compiler.jmh_generated.VectorReduction2_NoSuperword_longMaxDotProduct_jmhTest::longMaxDotProduct_avgt_jmhStub at 17 (line 190)
   8.46%  │││ ││││  │    0x00007f3fcc0fa728:   cmpq		%r9, %rsi
          │││╭││││  │    0x00007f3fcc0fa72b:   jl		0x7f3fcc0fa751      ;*lreturn {reexecute=0 rethrow=0 return_oop=0}
          ││││││││  │                                                              ; - java.lang.Math::max at 11 (line 2037)
          ││││││││  │                                                              ; - org.openjdk.bench.vm.compiler.VectorReduction2::longMaxDotProduct at 36 (line 1126)
          ││││││││  │                                                              ; - org.openjdk.bench.vm.compiler.jmh_generated.VectorReduction2_NoSuperword_longMaxDotProduct_jmhTest::longMaxDotProduct_avgt_jmhStub at 17 (line 190)

Patch transforms this into conditional moves:

  11.00%  │  0x00007f66f40f70b2:   movq		0x18(%r13, %rcx, 8), %rax
          │  0x00007f66f40f70b7:   imulq		0x18(%r9, %rcx, 8), %rax;*lmul {reexecute=0 rethrow=0 return_oop=0}
          │                                                            ; - org.openjdk.bench.vm.compiler.VectorReduction2::longMaxDotProduct at 30 (line 1125)
          │                                                            ; - org.openjdk.bench.vm.compiler.jmh_generated.VectorReduction2_NoSuperword_longMaxDotProduct_jmhTest::longMaxDotProduct_avgt_jmhStub at 17 (line 190)
          │  0x00007f66f40f70bd:   cmpq		%rdx, %rax
  13.07%  │  0x00007f66f40f70c0:   cmovlq		%rdx, %rax          ;*invokestatic max {reexecute=0 rethrow=0 return_oop=0}
          │                                                            ; - org.openjdk.bench.vm.compiler.VectorReduction2::longMaxDotProduct at 36 (line 1126)
          │                                                            ; - org.openjdk.bench.vm.compiler.jmh_generated.VectorReduction2_NoSuperword_longMaxDotProduct_jmhTest::longMaxDotProduct_avgt_jmhStub at 17 (line 190)

This is similar to what we have seen above. Lacking superword functionality, the fallback for MaxL/MinL implies using conditional moves. Although branch probabilities are not controlled here, we can observe that one of the branches is likely being taken ~100% of the time.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/20098#issuecomment-2642788364