RFR: 8307513: C2: intrinsify Math.max(long,long) and Math.min(long,long) [v12]

Wed Feb 19 17:42:08 UTC 2025

On Fri, 7 Feb 2025 12:39:24 GMT, Galder Zamarreño <galder at openjdk.org> wrote:

>> This patch intrinsifies `Math.max(long, long)` and `Math.min(long, long)` in order to help improve vectorization performance.
>> 
>> Currently vectorization does not kick in for loops containing either of these calls because of the following error:
>> 
>> 
>> VLoop::check_preconditions: failed: control flow in loop not allowed
>> 
>> 
>> The control flow is due to the java implementation for these methods, e.g.
>> 
>> 
>> public static long max(long a, long b) {
>>     return (a >= b) ? a : b;
>> }
>> 
>> 
>> This patch intrinsifies the calls to replace the CmpL + Bool nodes for MaxL/MinL nodes respectively.
>> By doing this, vectorization no longer finds the control flow and so it can carry out the vectorization.
>> E.g.
>> 
>> 
>> SuperWord::transform_loop:
>>     Loop: N518/N126  counted [int,int),+4 (1025 iters)  main has_sfpt strip_mined
>>  518  CountedLoop  === 518 246 126  [[ 513 517 518 242 521 522 422 210 ]] inner stride: 4 main of N518 strip mined !orig=[419],[247],[216],[193] !jvms: Test::test @ bci:14 (line 21)
>> 
>> 
>> Applying the same changes to `ReductionPerf` as in https://github.com/openjdk/jdk/pull/13056, we can compare the results before and after. Before the patch, on darwin/aarch64 (M1):
>> 
>> 
>> ==============================
>> Test summary
>> ==============================
>>    TEST                                              TOTAL  PASS  FAIL ERROR
>>    jtreg:test/hotspot/jtreg/compiler/loopopts/superword/ReductionPerf.java
>>                                                          1     1     0     0
>> ==============================
>> TEST SUCCESS
>> 
>> long min   1155
>> long max   1173
>> 
>> 
>> After the patch, on darwin/aarch64 (M1):
>> 
>> 
>> ==============================
>> Test summary
>> ==============================
>>    TEST                                              TOTAL  PASS  FAIL ERROR
>>    jtreg:test/hotspot/jtreg/compiler/loopopts/superword/ReductionPerf.java
>>                                                          1     1     0     0
>> ==============================
>> TEST SUCCESS
>> 
>> long min   1042
>> long max   1042
>> 
>> 
>> This patch does not add an platform-specific backend implementations for the MaxL/MinL nodes.
>> Therefore, it still relies on the macro expansion to transform those into CMoveL.
>> 
>> I've run tier1 and hotspot compiler tests on darwin/aarch64 and got these results:
>> 
>> 
>> ==============================
>> Test summary
>> ==============================
>>    TEST                                              TOTAL  PA...
>
> Galder Zamarreño has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 44 additional commits since the last revision:
> 
>  - Merge branch 'master' into topic.intrinsify-max-min-long
>  - Fix typo
>  - Renaming methods and variables and add docu on algorithms
>  - Fix copyright years
>  - Make sure it runs with cpus with either avx512 or asimd
>  - Test can only run with 256 bit registers or bigger
>    
>    * Remove platform dependant check
>    and use platform independent configuration instead.
>  - Fix license header
>  - Tests should also run on aarch64 asimd=true envs
>  - Added comment around the assertions
>  - Adjust min/max identity IR test expectations after changes
>  - ... and 34 more: https://git.openjdk.org/jdk/compare/75abfbc2...a190ae68

Following our discussion, I've run `MinMaxVector.long` benchmarks with superword disabled and with/without `_maxL` intrinsic in both AVX-512 and AVX2 modes.

The first thing I've observed is that lacking superword, the results with AVX-512 or AVX2 are identical, so I will just focus on AVX-512 results below.

Benchmark                              (probability)  (range)  (seed)  (size)   Mode  Cnt     -maxL     +maxLr   Units
MinMaxVector.longClippingRange                   N/A       90       0    1000  thrpt    4  1012.017  1011.8109  ops/ms
MinMaxVector.longClippingRange                   N/A      100       0    1000  thrpt    4  1012.113  1011.9530  ops/ms
MinMaxVector.longLoopMax                          50      N/A     N/A    2048  thrpt    4   463.946   473.9408  ops/ms
MinMaxVector.longLoopMax                          80      N/A     N/A    2048  thrpt    4   465.391   473.8063  ops/ms
MinMaxVector.longLoopMax                         100      N/A     N/A    2048  thrpt    4   510.992   471.6280  ops/ms (-8%)
MinMaxVector.longLoopMin                          50      N/A     N/A    2048  thrpt    4   496.036   495.3142  ops/ms
MinMaxVector.longLoopMin                          80      N/A     N/A    2048  thrpt    4   495.797   497.1214  ops/ms
MinMaxVector.longLoopMin                         100      N/A     N/A    2048  thrpt    4   495.302   495.1535  ops/ms
MinMaxVector.longReductionMultiplyMax             50      N/A     N/A    2048  thrpt    4   405.495   405.3936  ops/ms
MinMaxVector.longReductionMultiplyMax             80      N/A     N/A    2048  thrpt    4   405.342   405.4505  ops/ms
MinMaxVector.longReductionMultiplyMax            100      N/A     N/A    2048  thrpt    4   846.492   405.4779  ops/ms (-52%)
MinMaxVector.longReductionMultiplyMin             50      N/A     N/A    2048  thrpt    4   414.755   414.7036  ops/ms
MinMaxVector.longReductionMultiplyMin             80      N/A     N/A    2048  thrpt    4   414.705   414.7093  ops/ms
MinMaxVector.longReductionMultiplyMin            100      N/A     N/A    2048  thrpt    4   414.761   414.7150  ops/ms
MinMaxVector.longReductionSimpleMax               50      N/A     N/A    2048  thrpt    4   460.435   460.3764  ops/ms
MinMaxVector.longReductionSimpleMax               80      N/A     N/A    2048  thrpt    4   460.438   460.4718  ops/ms
MinMaxVector.longReductionSimpleMax              100      N/A     N/A    2048  thrpt    4  1023.005   460.5417  ops/ms (-55%)
MinMaxVector.longReductionSimpleMin               50      N/A     N/A    2048  thrpt    4   459.184   459.1662  ops/ms
MinMaxVector.longReductionSimpleMin               80      N/A     N/A    2048  thrpt    4   459.265   459.2588  ops/ms
MinMaxVector.longReductionSimpleMin              100      N/A     N/A    2048  thrpt    4   459.263   459.1304  ops/ms

`longLoopMax at 100%`, `longReductionMultiplyMax at 100%` and `longReductionSimpleMax at 100%` are regressions with the `_maxL` intrinsic. The cause is familiar: without the intrinsic cmp+mov are emitted, while with the intrinsic and conditions above, `cmov` is emitted:

# `longLoopMax` @ 100%

-maxL:

   4.18%  ││││  │││   │           0x00007fb7580f84b2:   cmpq		%r13, %r11
          ││││╭ │││   │           0x00007fb7580f84b5:   jl		0x7fb7580f84ec      ;*lreturn {reexecute=0 rethrow=0 return_oop=0}
          │││││ │││   │                                                                     ; - java.lang.Math::max at 11 (line 2038)
          │││││ │││   │                                                                     ; - org.openjdk.bench.java.lang.MinMaxVector::longLoopMax at 27 (line 256)
          │││││ │││   │                                                                     ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longLoopMax_jmhTest::longLoopMax_thrpt_jmhStub at 19 (line 124)
   4.23%  │││││ │││↗  │           0x00007fb7580f84bb:   movq		%r11, 0x10(%rbp, %rsi, 8);*lastore {reexecute=0 rethrow=0 return_oop=0}
          │││││ ││││  │                                                                     ; - org.openjdk.bench.java.lang.MinMaxVector::longLoopMax at 30 (line 256)
          │││││ ││││  │                                                                     ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longLoopMax_jmhTest::longLoopMax_thrpt_jmhStub at 19 (line 124)

+maxL:

   1.06%  │││  0x00007fe1b40f5ed1:   movq		0x20(%rbx, %r10, 8), %r14;*laload {reexecute=0 rethrow=0 return_oop=0}
          │││                                                            ; - org.openjdk.bench.java.lang.MinMaxVector::longLoopMax at 26 (line 256)
          │││                                                            ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longLoopMax_jmhTest::longLoopMax_thrpt_jmhStub at 19 (line 124)
   1.34%  │││  0x00007fe1b40f5ed6:   cmpq		%r14, %r9
   2.78%  │││  0x00007fe1b40f5ed9:   cmovlq		%r14, %r9
   2.58%  │││  0x00007fe1b40f5edd:   movq		%r9, 0x20(%rax, %r10, 8);*lastore {reexecute=0 rethrow=0 return_oop=0}
          │││                                                            ; - org.openjdk.bench.java.lang.MinMaxVector::longLoopMax at 30 (line 256)
          │││                                                            ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longLoopMax_jmhTest::longLoopMax_thrpt_jmhStub at 19 (line 124)

# `longReductionMultiplyMax` @ 100%

-maxL:

   6.71%  ││  ││↗    0x00007f8af40f6278:   imulq		$0xb, 0x18(%r14, %r8, 8), %rdx
          ││  │││                                                              ;*lmul {reexecute=0 rethrow=0 return_oop=0}
          ││  │││                                                              ; - org.openjdk.bench.java.lang.MinMaxVector::longReductionMultiplyMax at 24 (line 285)
          ││  │││                                                              ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longReductionMultiplyMax_jmhTest::longReductionMultiplyMax_thrpt_jmhStub at 19 (line 124)
   5.28%  ││  │││    0x00007f8af40f627e:   nop
  10.23%  ││  │││    0x00007f8af40f6280:   cmpq		%rdx, %rdi
          ││╭ │││    0x00007f8af40f6283:   jge		0x7f8af40f62a7      ;*lreturn {reexecute=0 rethrow=0 return_oop=0}
          │││ │││                                                              ; - java.lang.Math::max at 11 (line 2038)
          │││ │││                                                              ; - org.openjdk.bench.java.lang.MinMaxVector::longReductionMultiplyMax at 30 (line 286)
          │││ │││                                                              ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longReductionMultiplyMax_jmhTest::longReductionMultiplyMax_thrpt_jmhStub at 19 (line 124)

+maxL:

  11.07%  ││  0x00007f47000f5c4d:   imulq		$0xb, 0x18(%r14, %r11, 8), %rax
          ││                                                            ;*lmul {reexecute=0 rethrow=0 return_oop=0}
          ││                                                            ; - org.openjdk.bench.java.lang.MinMaxVector::longReductionMultiplyMax at 24 (line 285)
          ││                                                            ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longReductionMultiplyMax_jmhTest::longReductionMultiplyMax_thrpt_jmhStub at 19 (line 124)
   0.07%  ││  0x00007f47000f5c53:   cmpq		%rdx, %rax
  11.87%  ││  0x00007f47000f5c56:   cmovlq		%rdx, %rax          ;*invokestatic max {reexecute=0 rethrow=0 return_oop=0}
          ││                                                            ; - org.openjdk.bench.java.lang.MinMaxVector::longReductionMultiplyMax at 30 (line 286)
          ││                                                            ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longReductionMultiplyMax_jmhTest::longReductionMultiplyMax_thrpt_jmhStub at 19 (line 124)

# `longReductionSimpleMax` @ 100%

-maxL:

   5.71%  │││││     │││↗      │             0x00007fc2380f75f9:   movq		0x20(%r14, %r8, 8), %rdi;*laload {reexecute=0 rethrow=0 return_oop=0}
          │││││     ││││      │                                                                       ; - org.openjdk.bench.java.lang.MinMaxVector::longReductionSimpleMax at 20 (line 295)
          │││││     ││││      │                                                                       ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longReductionSimpleMax_jmhTest::longReductionSimpleMax_thrpt_jmhStub at 19 (line 124)
   1.85%  │││││     ││││      │             0x00007fc2380f75fe:   nop
   4.52%  │││││     ││││      │             0x00007fc2380f7600:   cmpq		%rdi, %rdx
          │││││╭    ││││      │             0x00007fc2380f7603:   jge		0x7fc2380f7667      ;*lreturn {reexecute=0 rethrow=0 return_oop=0}
          ││││││    ││││      │                                                                       ; - java.lang.Math::max at 11 (line 2038)
          ││││││    ││││      │                                                                       ; - org.openjdk.bench.java.lang.MinMaxVector::longReductionSimpleMax at 26 (line 296)
          ││││││    ││││      │                                                                       ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longReductionSimpleMax_jmhTest::longReductionSimpleMax_thrpt_jmhStub at 19 (line 124)

+maxL:

   3.06%   ││││││  0x00007fa6d00f6020:   movq		0x70(%r14, %r11, 8), %r8;*laload {reexecute=0 rethrow=0 return_oop=0}
           ││││││                                                            ; - org.openjdk.bench.java.lang.MinMaxVector::longReductionSimpleMax at 20 (line 295)
           ││││││                                                            ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longReductionSimpleMax_jmhTest::longReductionSimpleMax_thrpt_jmhStub at 19 (line 124)
           ││││││  0x00007fa6d00f6025:   cmpq		%r8, %r13
   2.88%   ││││││  0x00007fa6d00f6028:   cmovlq		%r8, %r13           ;*invokestatic max {reexecute=0 rethrow=0 return_oop=0}
           ││││││                                                            ; - org.openjdk.bench.java.lang.MinMaxVector::longReductionSimpleMax at 26 (line 296)
           ││││││                                                            ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longReductionSimpleMax_jmhTest::longReductionSimpleMax_thrpt_jmhStub at 19 (line 124)

-------------

PR Comment: https://git.openjdk.org/jdk/pull/20098#issuecomment-2669329851