RFR: 8307513: C2: intrinsify Math.max(long,long) and Math.min(long,long) [v12]

Tue Feb 18 08:04:21 UTC 2025

On Fri, 7 Feb 2025 12:39:24 GMT, Galder Zamarreño <galder at openjdk.org> wrote:

>> This patch intrinsifies `Math.max(long, long)` and `Math.min(long, long)` in order to help improve vectorization performance.
>> 
>> Currently vectorization does not kick in for loops containing either of these calls because of the following error:
>> 
>> 
>> VLoop::check_preconditions: failed: control flow in loop not allowed
>> 
>> 
>> The control flow is due to the java implementation for these methods, e.g.
>> 
>> 
>> public static long max(long a, long b) {
>>     return (a >= b) ? a : b;
>> }
>> 
>> 
>> This patch intrinsifies the calls to replace the CmpL + Bool nodes for MaxL/MinL nodes respectively.
>> By doing this, vectorization no longer finds the control flow and so it can carry out the vectorization.
>> E.g.
>> 
>> 
>> SuperWord::transform_loop:
>>     Loop: N518/N126  counted [int,int),+4 (1025 iters)  main has_sfpt strip_mined
>>  518  CountedLoop  === 518 246 126  [[ 513 517 518 242 521 522 422 210 ]] inner stride: 4 main of N518 strip mined !orig=[419],[247],[216],[193] !jvms: Test::test @ bci:14 (line 21)
>> 
>> 
>> Applying the same changes to `ReductionPerf` as in https://github.com/openjdk/jdk/pull/13056, we can compare the results before and after. Before the patch, on darwin/aarch64 (M1):
>> 
>> 
>> ==============================
>> Test summary
>> ==============================
>>    TEST                                              TOTAL  PASS  FAIL ERROR
>>    jtreg:test/hotspot/jtreg/compiler/loopopts/superword/ReductionPerf.java
>>                                                          1     1     0     0
>> ==============================
>> TEST SUCCESS
>> 
>> long min   1155
>> long max   1173
>> 
>> 
>> After the patch, on darwin/aarch64 (M1):
>> 
>> 
>> ==============================
>> Test summary
>> ==============================
>>    TEST                                              TOTAL  PASS  FAIL ERROR
>>    jtreg:test/hotspot/jtreg/compiler/loopopts/superword/ReductionPerf.java
>>                                                          1     1     0     0
>> ==============================
>> TEST SUCCESS
>> 
>> long min   1042
>> long max   1042
>> 
>> 
>> This patch does not add an platform-specific backend implementations for the MaxL/MinL nodes.
>> Therefore, it still relies on the macro expansion to transform those into CMoveL.
>> 
>> I've run tier1 and hotspot compiler tests on darwin/aarch64 and got these results:
>> 
>> 
>> ==============================
>> Test summary
>> ==============================
>>    TEST                                              TOTAL  PA...
>
> Galder Zamarreño has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 44 additional commits since the last revision:
> 
>  - Merge branch 'master' into topic.intrinsify-max-min-long
>  - Fix typo
>  - Renaming methods and variables and add docu on algorithms
>  - Fix copyright years
>  - Make sure it runs with cpus with either avx512 or asimd
>  - Test can only run with 256 bit registers or bigger
>    
>    * Remove platform dependant check
>    and use platform independent configuration instead.
>  - Fix license header
>  - Tests should also run on aarch64 asimd=true envs
>  - Added comment around the assertions
>  - Adjust min/max identity IR test expectations after changes
>  - ... and 34 more: https://git.openjdk.org/jdk/compare/6ad0c61a...a190ae68

What is happening with int min/max needs a separate investigation because based on my testing, the int min/max intrinsic is both a regression and a performance improvement! Check this out:

make test TEST="micro:org.openjdk.bench.java.lang.MinMaxVector.intReductionSimpleMax" MICRO="FORK=1"
Benchmark                           (probability)  (size)   Mode  Cnt    Score   Error   Units
MinMaxVector.intReductionSimpleMax             50    2048  thrpt    4  460.585 ± 0.348  ops/ms
MinMaxVector.intReductionSimpleMax             80    2048  thrpt    4  460.633 ± 0.103  ops/ms
MinMaxVector.intReductionSimpleMax            100    2048  thrpt    4  460.580 ± 0.091  ops/ms

make test TEST="micro:org.openjdk.bench.java.lang.MinMaxVector.intReductionSimpleMax" MICRO="FORK=1;OPTIONS=-jvmArgs -XX:CompileCommand=option,org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_intReductionSimpleMax_jmhTest::intReductionSimpleMax_thrpt_jmhStub,ccstrlist,DisableIntrinsic,_max"
Benchmark                           (probability)  (size)   Mode  Cnt     Score   Error   Units
MinMaxVector.intReductionSimpleMax             50    2048  thrpt    4   460.479 ± 0.044  ops/ms
MinMaxVector.intReductionSimpleMax             80    2048  thrpt    4   460.587 ± 0.106  ops/ms
MinMaxVector.intReductionSimpleMax            100    2048  thrpt    4  1027.831 ± 9.353  ops/ms
80%:
          ││      │   0x00007ffb200fa089:   cmpl		%r11d, %r10d
   3.04%  ││      │   0x00007ffb200fa08c:   cmovll		%r11d, %r10d
   4.38%  ││      │   0x00007ffb200fa090:   cmpl		%ebx, %r10d
   1.61%  ││      │   0x00007ffb200fa093:   cmovll		%ebx, %r10d
   2.79%  ││      │   0x00007ffb200fa097:   cmpl		%edi, %r10d
   2.92%  ││      │   0x00007ffb200fa09a:   cmovll		%edi, %r10d         ;*ireturn {reexecute=0 rethrow=0 return_oop=0}
          ││      │                                                             ; - java.lang.Math::max at 10 (line 2023)
          ││      │                                                             ; - org.openjdk.bench.java.lang.MinMaxVector::intReductionSimpleMax at 23 (line 232)
100%:
   3.11%  │││││││   ││││││    │             0x00007f26c00f8f9c:   nopl		(%rax)
   3.31%  │││││││   ││││││    │             0x00007f26c00f8fa0:   cmpl		%r10d, %ecx
          │││││││╭  ││││││    │             0x00007f26c00f8fa3:   jge		0x7f26c00f8ff1      ;*ireturn {reexecute=0 rethrow=0 return_oop=0}
          ││││││││  ││││││    │                                                                       ; - java.lang.Math::max at 10 (line 2023)
          ││││││││  ││││││    │                                                                       ; - org.openjdk.bench.java.lang.MinMaxVector::intReductionSimpleMax at 23 (line 232)
          ││││││││  ││││││    │                                                                       ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_intReductionSimpleMax_jmhTest::intReductionSimpleMax_thrpt_jmhStub at 19 (line 124)

make test TEST="micro:org.openjdk.bench.java.lang.MinMaxVector.intReductionMultiplyMax" MICRO="FORK=1"
Benchmark                             (probability)  (size)   Mode  Cnt     Score   Error   Units
MinMaxVector.intReductionMultiplyMax             50    2048  thrpt    4  2815.614 ± 0.406  ops/ms
MinMaxVector.intReductionMultiplyMax             80    2048  thrpt    4  2814.943 ± 2.174  ops/ms
MinMaxVector.intReductionMultiplyMax            100    2048  thrpt    4  2815.285 ± 1.725  ops/ms

make test TEST="micro:org.openjdk.bench.java.lang.MinMaxVector.intReductionMultiplyMax" MICRO="FORK=1;OPTIONS=-jvmArgs -XX:CompileCommand=option,org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_intReductionMultiplyMax_jmhTest::intReductionMultiplyMax_thrpt_jmhStub,ccstrlist,DisableIntrinsic,_max"
Benchmark                             (probability)  (size)   Mode  Cnt     Score   Error   Units
MinMaxVector.intReductionMultiplyMax             50    2048  thrpt    4  2802.062 ± 0.710  ops/ms
MinMaxVector.intReductionMultiplyMax             80    2048  thrpt    4  2814.874 ± 4.058  ops/ms
MinMaxVector.intReductionMultiplyMax            100    2048  thrpt    4   883.879 ± 0.327  ops/ms
80%:
   3.54%  │    ││ │││││  0x00007faa700fa177:   vpmaxsd		%ymm4, %ymm5, %ymm13;*ireturn {reexecute=0 rethrow=0 return_oop=0}
          │    ││ │││││                                                            ; - java.lang.Math::max at 10 (line 2023)

100:
   7.50%  ││││││││││││││││││↗ │             0x00007f75280f8849:   imull		$0xb, 0x2c(%rbp, %r11, 4), %r10d
          │││││││││││││││││││ │                                                                       ;*imul {reexecute=0 rethrow=0 return_oop=0}
          │││││││││││││││││││ │                                                                       ; - org.openjdk.bench.java.lang.MinMaxVector::intReductionMultiplyMax at 20 (line 221)
          │││││││││││││││││││ │                                                                       ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_intReductionMultiplyMax_jmhTest::intReductionMultiplyMax_thrpt_jmhStub at 19 (line 124)
   3.85%  │││││││││││││││││││ │             0x00007f75280f884f:   cmpl		%r10d, %r8d
          ││││││││││╰││││││││ │             0x00007f75280f8852:   jl		0x7f75280f87d0      ;*if_icmplt {reexecute=0 rethrow=0 return_oop=0}
          ││││││││││ ││││││││ │                                                                       ; - java.lang.Math::max at 2 (line 2023)
          ││││││││││ ││││││││ │                                                                       ; - org.openjdk.bench.java.lang.MinMaxVector::intReductionMultiplyMax at 26 (line 222)
          ││││││││││ ││││││││ │                                                                       ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_intReductionMultiplyMax_jmhTest::intReductionMultiplyMax_thrpt_jmhStub at 19 (line 124)

I ran the exact same test with longs and I don't see such an issue. The performance is always the same either with the intrisinc or disabling it as shown above.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/20098#issuecomment-2664871838