RFR: 8307513: C2: intrinsify Math.max(long,long) and Math.min(long,long) [v12]

Thu Feb 20 06:53:04 UTC 2025

On Fri, 7 Feb 2025 12:39:24 GMT, Galder Zamarreño <galder at openjdk.org> wrote:

>> This patch intrinsifies `Math.max(long, long)` and `Math.min(long, long)` in order to help improve vectorization performance.
>> 
>> Currently vectorization does not kick in for loops containing either of these calls because of the following error:
>> 
>> 
>> VLoop::check_preconditions: failed: control flow in loop not allowed
>> 
>> 
>> The control flow is due to the java implementation for these methods, e.g.
>> 
>> 
>> public static long max(long a, long b) {
>>     return (a >= b) ? a : b;
>> }
>> 
>> 
>> This patch intrinsifies the calls to replace the CmpL + Bool nodes for MaxL/MinL nodes respectively.
>> By doing this, vectorization no longer finds the control flow and so it can carry out the vectorization.
>> E.g.
>> 
>> 
>> SuperWord::transform_loop:
>>     Loop: N518/N126  counted [int,int),+4 (1025 iters)  main has_sfpt strip_mined
>>  518  CountedLoop  === 518 246 126  [[ 513 517 518 242 521 522 422 210 ]] inner stride: 4 main of N518 strip mined !orig=[419],[247],[216],[193] !jvms: Test::test @ bci:14 (line 21)
>> 
>> 
>> Applying the same changes to `ReductionPerf` as in https://github.com/openjdk/jdk/pull/13056, we can compare the results before and after. Before the patch, on darwin/aarch64 (M1):
>> 
>> 
>> ==============================
>> Test summary
>> ==============================
>>    TEST                                              TOTAL  PASS  FAIL ERROR
>>    jtreg:test/hotspot/jtreg/compiler/loopopts/superword/ReductionPerf.java
>>                                                          1     1     0     0
>> ==============================
>> TEST SUCCESS
>> 
>> long min   1155
>> long max   1173
>> 
>> 
>> After the patch, on darwin/aarch64 (M1):
>> 
>> 
>> ==============================
>> Test summary
>> ==============================
>>    TEST                                              TOTAL  PASS  FAIL ERROR
>>    jtreg:test/hotspot/jtreg/compiler/loopopts/superword/ReductionPerf.java
>>                                                          1     1     0     0
>> ==============================
>> TEST SUCCESS
>> 
>> long min   1042
>> long max   1042
>> 
>> 
>> This patch does not add an platform-specific backend implementations for the MaxL/MinL nodes.
>> Therefore, it still relies on the macro expansion to transform those into CMoveL.
>> 
>> I've run tier1 and hotspot compiler tests on darwin/aarch64 and got these results:
>> 
>> 
>> ==============================
>> Test summary
>> ==============================
>>    TEST                                              TOTAL  PA...
>
> Galder Zamarreño has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 44 additional commits since the last revision:
> 
>  - Merge branch 'master' into topic.intrinsify-max-min-long
>  - Fix typo
>  - Renaming methods and variables and add docu on algorithms
>  - Fix copyright years
>  - Make sure it runs with cpus with either avx512 or asimd
>  - Test can only run with 256 bit registers or bigger
>    
>    * Remove platform dependant check
>    and use platform independent configuration instead.
>  - Fix license header
>  - Tests should also run on aarch64 asimd=true envs
>  - Added comment around the assertions
>  - Adjust min/max identity IR test expectations after changes
>  - ... and 34 more: https://git.openjdk.org/jdk/compare/af7645e5...a190ae68

To follow up https://github.com/openjdk/jdk/pull/20098#issuecomment-2669329851, I've run `MinMaxVector.int` benchmarks with **superword disabled** and with/without `_max`/`_min` intrinsics in both AVX-512 and AVX2 modes.

# AVX-512

Benchmark                             (probability)  (range)  (seed)  (size)   Mode  Cnt  -min/-max  +min/+max   Units
MinMaxVector.intClippingRange                   N/A       90       0    1000  thrpt    4   1067.050   1038.640  ops/ms
MinMaxVector.intClippingRange                   N/A      100       0    1000  thrpt    4   1041.922   1039.004  ops/ms
MinMaxVector.intLoopMax                          50      N/A     N/A    2048  thrpt    4    605.173    604.337  ops/ms
MinMaxVector.intLoopMax                          80      N/A     N/A    2048  thrpt    4    605.106    604.309  ops/ms
MinMaxVector.intLoopMax                         100      N/A     N/A    2048  thrpt    4    604.547    604.432  ops/ms
MinMaxVector.intLoopMin                          50      N/A     N/A    2048  thrpt    4    495.042    605.216  ops/ms (+22%)
MinMaxVector.intLoopMin                          80      N/A     N/A    2048  thrpt    4    495.105    495.217  ops/ms
MinMaxVector.intLoopMin                         100      N/A     N/A    2048  thrpt    4    495.040    495.176  ops/ms
MinMaxVector.intReductionMultiplyMax             50      N/A     N/A    2048  thrpt    4    407.920    407.984  ops/ms
MinMaxVector.intReductionMultiplyMax             80      N/A     N/A    2048  thrpt    4    407.710    407.965  ops/ms
MinMaxVector.intReductionMultiplyMax            100      N/A     N/A    2048  thrpt    4    874.881    407.922  ops/ms (-53%)
MinMaxVector.intReductionMultiplyMin             50      N/A     N/A    2048  thrpt    4    407.911    407.947  ops/ms
MinMaxVector.intReductionMultiplyMin             80      N/A     N/A    2048  thrpt    4    408.015    408.024  ops/ms
MinMaxVector.intReductionMultiplyMin            100      N/A     N/A    2048  thrpt    4    407.978    407.994  ops/ms
MinMaxVector.intReductionSimpleMax               50      N/A     N/A    2048  thrpt    4    460.538    460.439  ops/ms
MinMaxVector.intReductionSimpleMax               80      N/A     N/A    2048  thrpt    4    460.579    460.542  ops/ms
MinMaxVector.intReductionSimpleMax              100      N/A     N/A    2048  thrpt    4    998.211    460.404  ops/ms (-53%)
MinMaxVector.intReductionSimpleMin               50      N/A     N/A    2048  thrpt    4    460.570    460.447  ops/ms
MinMaxVector.intReductionSimpleMin               80      N/A     N/A    2048  thrpt    4    460.552    460.493  ops/ms
MinMaxVector.intReductionSimpleMin              100      N/A     N/A    2048  thrpt    4    460.455    460.485  ops/ms

There is some improvement in `intLoopMin` @ 50% but this didn't materialize in the `perfasm` run, so I don't think it can be strictly be correlated with the use/not-use of the intrinsic.

`intReductionMultiplyMax` and `intReductionSimpleMax` @ 100% regressions with the `max` intrinsic activated are consistent with what the saw with long.

### `intReductionMultiplyMin` and `intReductionSimpleMin` @ 100% same performance

There is something very intriguing happening here, which I don't know it's due to min itself or int vs long. Basically, with or without the `min` intrinsic the performance of these 2 benchmarks is same at 100% branch probability. What is going on? Let's look at one of them:

-min

# VM options: -Djava.library.path=/home/vagrant/1/jdk-intrinsify-max-min-long/build/release-linux-x86_64/images/test/micro/native -XX:+UnlockDiagnosticVMOptions -XX:DisableIntrinsic=_max,_min -XX:-UseSuperWord
...
   3.04%  ││││    │   0x00007f49280f76e9:   cmpl		%edi, %r10d
   3.14%  ││││    │   0x00007f49280f76ec:   cmovgl		%edi, %r10d         ;*ireturn {reexecute=0 rethrow=0 return_oop=0}
          ││││    │                                                             ; - java.lang.Math::min at 10 (line 2119)
          ││││    │                                                             ; - org.openjdk.bench.java.lang.MinMaxVector::intReductionSimpleMin at 23 (line 212)
          ││││    │                                                             ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_intReductionSimpleMin_jmhTest::intReductionSimpleMin_thrpt_jmhStub at 19 (line 124)

+min

# VM options: -Djava.library.path=/home/vagrant/1/jdk-intrinsify-max-min-long/build/release-linux-x86_64/images/test/micro/native -XX:-UseSuperWord
...
   3.10%  ││      │   0x00007fbf340f6b97:   cmpl		%edi, %r10d
   3.08%  ││      │   0x00007fbf340f6b9a:   cmovgl		%edi, %r10d         ;*invokestatic min {reexecute=0 rethrow=0 return_oop=0}
          ││      │                                                             ; - org.openjdk.bench.java.lang.MinMaxVector::intReductionSimpleMin at 23 (line 212)
          ││      │                                                             ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_intReductionSimpleMin_jmhTest::intReductionSimpleMin_thrpt_jmhStub at 19 (line 124)

Both are `cmov`. You can see how without the intrinsic the `Math::min` bytecode gets executed and transformed into a `cmov` and the same with the intrinsic.

I will verify this with long shortly to see if this behaviour is specific to `min` operation or something to do with int vs long.

# AVX2

Here are the AVX2 numbers:

Benchmark                             (probability)  (range)  (seed)  (size)   Mode  Cnt  -min/-max  +min/+max   Units
MinMaxVector.intClippingRange                   N/A       90       0    1000  thrpt    4   1068.265   1039.087  ops/ms
MinMaxVector.intClippingRange                   N/A      100       0    1000  thrpt    4   1067.705   1038.760  ops/ms
MinMaxVector.intLoopMax                          50      N/A     N/A    2048  thrpt    4    605.015    604.364  ops/ms
MinMaxVector.intLoopMax                          80      N/A     N/A    2048  thrpt    4    605.169    604.366  ops/ms
MinMaxVector.intLoopMax                         100      N/A     N/A    2048  thrpt    4    604.527    604.494  ops/ms
MinMaxVector.intLoopMin                          50      N/A     N/A    2048  thrpt    4    605.099    605.057  ops/ms
MinMaxVector.intLoopMin                          80      N/A     N/A    2048  thrpt    4    495.071    605.080  ops/ms (+22%)
MinMaxVector.intLoopMin                         100      N/A     N/A    2048  thrpt    4    495.134    495.047  ops/ms
MinMaxVector.intReductionMultiplyMax             50      N/A     N/A    2048  thrpt    4    407.953    407.987  ops/ms
MinMaxVector.intReductionMultiplyMax             80      N/A     N/A    2048  thrpt    4    407.861    408.005  ops/ms
MinMaxVector.intReductionMultiplyMax            100      N/A     N/A    2048  thrpt    4    873.915    407.995  ops/ms (-53%)
MinMaxVector.intReductionMultiplyMin             50      N/A     N/A    2048  thrpt    4    408.019    407.987  ops/ms
MinMaxVector.intReductionMultiplyMin             80      N/A     N/A    2048  thrpt    4    407.971    408.009  ops/ms
MinMaxVector.intReductionMultiplyMin            100      N/A     N/A    2048  thrpt    4    407.970    407.956  ops/ms
MinMaxVector.intReductionSimpleMax               50      N/A     N/A    2048  thrpt    4    460.443    460.514  ops/ms
MinMaxVector.intReductionSimpleMax               80      N/A     N/A    2048  thrpt    4    460.484    460.581  ops/ms
MinMaxVector.intReductionSimpleMax              100      N/A     N/A    2048  thrpt    4   1015.601    460.446  ops/ms (-54%)
MinMaxVector.intReductionSimpleMin               50      N/A     N/A    2048  thrpt    4    460.494    460.532  ops/ms
MinMaxVector.intReductionSimpleMin               80      N/A     N/A    2048  thrpt    4    460.489    460.451  ops/ms
MinMaxVector.intReductionSimpleMin              100      N/A     N/A    2048  thrpt    4   1021.420    460.435  ops/ms (-55%)

This time we see an improvement in `intLoopMin` @ 80% but again it was not observable in the `perfasm` run.

`intReductionMultiplyMax` and `intReductionSimpleMax` @ 100% have regressions, the familiar one of cmp+mov vs cmov. `intReductionMultiplyMin` @ 100% does not have a regression for the same reasons above, both use cmov.

The interesting thing is `intReductionSimpleMin` @ 100%. We see a regression there but I didn't observe it with the `perfasm` run. So, this could be due to variance in the application of `cmov` or not?

-------------

PR Comment: https://git.openjdk.org/jdk/pull/20098#issuecomment-2670609470