RFR: 8307513: C2: intrinsify Math.max(long,long) and Math.min(long,long) [v12]

Wed Feb 26 11:36:11 UTC 2025

On Fri, 7 Feb 2025 12:39:24 GMT, Galder Zamarreño <galder at openjdk.org> wrote:

>> This patch intrinsifies `Math.max(long, long)` and `Math.min(long, long)` in order to help improve vectorization performance.
>> 
>> Currently vectorization does not kick in for loops containing either of these calls because of the following error:
>> 
>> 
>> VLoop::check_preconditions: failed: control flow in loop not allowed
>> 
>> 
>> The control flow is due to the java implementation for these methods, e.g.
>> 
>> 
>> public static long max(long a, long b) {
>>     return (a >= b) ? a : b;
>> }
>> 
>> 
>> This patch intrinsifies the calls to replace the CmpL + Bool nodes for MaxL/MinL nodes respectively.
>> By doing this, vectorization no longer finds the control flow and so it can carry out the vectorization.
>> E.g.
>> 
>> 
>> SuperWord::transform_loop:
>>     Loop: N518/N126  counted [int,int),+4 (1025 iters)  main has_sfpt strip_mined
>>  518  CountedLoop  === 518 246 126  [[ 513 517 518 242 521 522 422 210 ]] inner stride: 4 main of N518 strip mined !orig=[419],[247],[216],[193] !jvms: Test::test @ bci:14 (line 21)
>> 
>> 
>> Applying the same changes to `ReductionPerf` as in https://github.com/openjdk/jdk/pull/13056, we can compare the results before and after. Before the patch, on darwin/aarch64 (M1):
>> 
>> 
>> ==============================
>> Test summary
>> ==============================
>>    TEST                                              TOTAL  PASS  FAIL ERROR
>>    jtreg:test/hotspot/jtreg/compiler/loopopts/superword/ReductionPerf.java
>>                                                          1     1     0     0
>> ==============================
>> TEST SUCCESS
>> 
>> long min   1155
>> long max   1173
>> 
>> 
>> After the patch, on darwin/aarch64 (M1):
>> 
>> 
>> ==============================
>> Test summary
>> ==============================
>>    TEST                                              TOTAL  PASS  FAIL ERROR
>>    jtreg:test/hotspot/jtreg/compiler/loopopts/superword/ReductionPerf.java
>>                                                          1     1     0     0
>> ==============================
>> TEST SUCCESS
>> 
>> long min   1042
>> long max   1042
>> 
>> 
>> This patch does not add an platform-specific backend implementations for the MaxL/MinL nodes.
>> Therefore, it still relies on the macro expansion to transform those into CMoveL.
>> 
>> I've run tier1 and hotspot compiler tests on darwin/aarch64 and got these results:
>> 
>> 
>> ==============================
>> Test summary
>> ==============================
>>    TEST                                              TOTAL  PA...
>
> Galder Zamarreño has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 44 additional commits since the last revision:
> 
>  - Merge branch 'master' into topic.intrinsify-max-min-long
>  - Fix typo
>  - Renaming methods and variables and add docu on algorithms
>  - Fix copyright years
>  - Make sure it runs with cpus with either avx512 or asimd
>  - Test can only run with 256 bit registers or bigger
>    
>    * Remove platform dependant check
>    and use platform independent configuration instead.
>  - Fix license header
>  - Tests should also run on aarch64 asimd=true envs
>  - Added comment around the assertions
>  - Adjust min/max identity IR test expectations after changes
>  - ... and 34 more: https://git.openjdk.org/jdk/compare/abdd4f5e...a190ae68

> > Re: [#20098 (comment)](https://github.com/openjdk/jdk/pull/20098#issuecomment-2671144644) - I was trying to think what could be causing this.
> 
> Maybe it is an issue with probabilities? Do you know at what point (if at all) the `MinI` node appears/disappears in that example?

The probabilities are fine.

I think the issue with `Math.min(II)` seems to be specific to when its compilation happens, and the combined fact that the intrinsic has been disabled and vectorization does not kick in (explicitly disabled). Note that other parts of the JDK invoke `Math.min(II)`.

In the slow cases it appears the compilation happens before the benchmark kicks in, and so it takes the profiling data before the benchmark to decide how to compile this in.

In the slow versions you see this `PrintMethodData`:

static java.lang.Math::min(II)I
  interpreter_invocation_count:       18171
  invocation_counter:                 18171
  backedge_counter:                       0
  decompile_count:                        0
  mdo size: 328 bytes

   0 iload_0
   1 iload_1
   2 if_icmpgt 9
  0    bci: 2    BranchData         taken(7732) displacement(56)
                                    not taken(10180)
   5 iload_0
   6 goto 10
  32   bci: 6    JumpData           taken(10180) displacement(24)
   9 iload_1
  10 ireturn

org.openjdk.bench.java.lang.MinMaxVector::intReductionSimpleMin(Lorg/openjdk/bench/java/lang/MinMaxVector$LoopState;)I
  interpreter_invocation_count:         189
  invocation_counter:                   189
  backedge_counter:                  313344
  decompile_count:                        0
  mdo size: 384 bytes

   0 iconst_0
   1 istore_2
   2 iconst_0
   3 istore_3
   4 iload_3
   5 aload_1
   6 fast_igetfield 35 <org/openjdk/bench/java/lang/MinMaxVector$LoopState.size:I>
   9 if_icmpge 33
  0    bci: 9    BranchData         taken(58) displacement(72)
                                    not taken(192512)
  12 aload_1
  13 fast_agetfield 41 <org/openjdk/bench/java/lang/MinMaxVector$LoopState.minIntA:[I>
  16 iload_3
  17 iaload
  18 istore #4
  20 iload_2
  21 fast_iload #4
  23 invokestatic 32 <java/lang/Math.min(II)I>
  32   bci: 23   CounterData        count(192512)
  26 istore_2
  27 iinc #3 1
  30 goto 4
  48   bci: 30   JumpData           taken(192512) displacement(-48)
  33 iload_2
  34 ireturn

The benchmark method calls Math.min `192_512` times, yet the method data shows only `18_171` invocations,
of which `7_732` are taken which is 42%.
So it gets compiled with a `cmov` and the benchmark will be slow because it will branch 100% one of the sides.

In the fast version, `PrintMethodData` looks like this:

static java.lang.Math::min(II)I
  interpreter_invocation_count:     1575322
  invocation_counter:               1575322
  backedge_counter:                       0
  decompile_count:                        0
  mdo size: 368 bytes

   0 iload_0
   1 iload_1
   2 if_icmpgt 9
  0    bci: 2    BranchData         taken(1418001) displacement(56)
                                    not taken(157062)
   5 iload_0
   6 goto 10
  32   bci: 6    JumpData           taken(157062) displacement(24)
   9 iload_1
  10 ireturn

org.openjdk.bench.java.lang.MinMaxVector::intReductionSimpleMin(Lorg/openjdk/bench/java/lang/MinMaxVector$LoopState;)I
  interpreter_invocation_count:         858
  invocation_counter:                   858
  backedge_counter:                 1756214
  decompile_count:                        0
  mdo size: 424 bytes

   0 iconst_0
   1 istore_2
   2 iconst_0
   3 istore_3
   4 iload_3
   5 aload_1
   6 fast_igetfield 35 <org/openjdk/bench/java/lang/MinMaxVector$LoopState.size:I>
   9 if_icmpge 33
  0    bci: 9    BranchData         taken(733) displacement(72)
                                    not taken(1637363)
  12 aload_1
  13 fast_agetfield 41 <org/openjdk/bench/java/lang/MinMaxVector$LoopState.minIntA:[I>
  16 iload_3
  17 iaload
  18 istore #4
  20 iload_2
  21 fast_iload #4
  23 invokestatic 32 <java/lang/Math.min(II)I>
  32   bci: 23   CounterData        count(1637363)
  26 istore_2
  27 iinc #3 1
  30 goto 4
  48   bci: 30   JumpData           taken(1637363) displacement(-48)
  33 iload_2
  34 ireturn

The benchmark method calls Math.min `1_637_363` times, and the method data shows `1_575_322` invocations,
of which `1_418_001` are taken which is 90%. So no cmov is introduced and the benchmark will be fast because it will branch 100% one of the sides.

A factor here might be my Xeon machine. I run the benchmark on a 4 core VM inside it, so given the limited resources compilation can take longer. I've noticed that it's easier to replicate this scenario there rather than my M1 laptop, which has 10 cores.

>> So, if those int scalar regressions were not a problem when int min/max intrinsic was added, I would expect the same to apply to long.
>
> Do you know when they were added? If that was a long time ago, we might not have noticed back then, but we might notice now.

I don't know when they were added.

> That said: if we know that it is only in the high-probability cases, then we can address those separately. I would not consider it a blocking issue, as long as we file the follow-up RFE for int/max scalar case with high branch probability.
> 
> What would be really helpful: a list of all regressions / issues, and how we intend to deal with them. If we later find a regression that someone cares about, then we can come back to that list, and justify the decision we made here.

I'll make up a list of regressions and post it here. I won't create RFEs for now. I'd rather wait until we have the list in front of us and we can decide which RFEs to create.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/20098#issuecomment-2684701935