RFR: 8307513: C2: intrinsify Math.max(long,long) and Math.min(long,long) [v12]

Mon Feb 17 17:05:28 UTC 2025

On Fri, 7 Feb 2025 12:39:24 GMT, Galder Zamarreño <galder at openjdk.org> wrote:

>> This patch intrinsifies `Math.max(long, long)` and `Math.min(long, long)` in order to help improve vectorization performance.
>> 
>> Currently vectorization does not kick in for loops containing either of these calls because of the following error:
>> 
>> 
>> VLoop::check_preconditions: failed: control flow in loop not allowed
>> 
>> 
>> The control flow is due to the java implementation for these methods, e.g.
>> 
>> 
>> public static long max(long a, long b) {
>>     return (a >= b) ? a : b;
>> }
>> 
>> 
>> This patch intrinsifies the calls to replace the CmpL + Bool nodes for MaxL/MinL nodes respectively.
>> By doing this, vectorization no longer finds the control flow and so it can carry out the vectorization.
>> E.g.
>> 
>> 
>> SuperWord::transform_loop:
>>     Loop: N518/N126  counted [int,int),+4 (1025 iters)  main has_sfpt strip_mined
>>  518  CountedLoop  === 518 246 126  [[ 513 517 518 242 521 522 422 210 ]] inner stride: 4 main of N518 strip mined !orig=[419],[247],[216],[193] !jvms: Test::test @ bci:14 (line 21)
>> 
>> 
>> Applying the same changes to `ReductionPerf` as in https://github.com/openjdk/jdk/pull/13056, we can compare the results before and after. Before the patch, on darwin/aarch64 (M1):
>> 
>> 
>> ==============================
>> Test summary
>> ==============================
>>    TEST                                              TOTAL  PASS  FAIL ERROR
>>    jtreg:test/hotspot/jtreg/compiler/loopopts/superword/ReductionPerf.java
>>                                                          1     1     0     0
>> ==============================
>> TEST SUCCESS
>> 
>> long min   1155
>> long max   1173
>> 
>> 
>> After the patch, on darwin/aarch64 (M1):
>> 
>> 
>> ==============================
>> Test summary
>> ==============================
>>    TEST                                              TOTAL  PASS  FAIL ERROR
>>    jtreg:test/hotspot/jtreg/compiler/loopopts/superword/ReductionPerf.java
>>                                                          1     1     0     0
>> ==============================
>> TEST SUCCESS
>> 
>> long min   1042
>> long max   1042
>> 
>> 
>> This patch does not add an platform-specific backend implementations for the MaxL/MinL nodes.
>> Therefore, it still relies on the macro expansion to transform those into CMoveL.
>> 
>> I've run tier1 and hotspot compiler tests on darwin/aarch64 and got these results:
>> 
>> 
>> ==============================
>> Test summary
>> ==============================
>>    TEST                                              TOTAL  PA...
>
> Galder Zamarreño has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 44 additional commits since the last revision:
> 
>  - Merge branch 'master' into topic.intrinsify-max-min-long
>  - Fix typo
>  - Renaming methods and variables and add docu on algorithms
>  - Fix copyright years
>  - Make sure it runs with cpus with either avx512 or asimd
>  - Test can only run with 256 bit registers or bigger
>    
>    * Remove platform dependant check
>    and use platform independent configuration instead.
>  - Fix license header
>  - Tests should also run on aarch64 asimd=true envs
>  - Added comment around the assertions
>  - Adjust min/max identity IR test expectations after changes
>  - ... and 34 more: https://git.openjdk.org/jdk/compare/ba549afe...a190ae68

Another interesting comparison arises above when comparing `test2` in 80% vs 100%:

test2 (100%):

 ;; B12: #	out( B21 B13 ) <- in( B11 B20 )  Freq: 1.6744e+09
  0x00007f15bcada2e9:   movl		0x14(%rsi, %rdx, 4), %r11d
                                                            ;*iaload {reexecute=0 rethrow=0 return_oop=0}
                                                            ; - TestIntMax::test2 at 14 (line 71)
  0x00007f15bcada2ee:   cmpl		%r11d, %r10d
  0x00007f15bcada2f1:   jge		0x7f15bcada362      ;*istore_1 {reexecute=0 rethrow=0 return_oop=0}
                                                            ; - TestIntMax::test2 at 25 (line 71)

test2(80%):

 ;; B10: #	out( B10 B11 ) <- in( B9 B10 ) Loop( B10-B10 inner main of N64 strip mined) Freq: 1.6744e+09
  0x00007fe850ada2f0:   movl		0x4c(%rsi, %rdx, 4), %r11d
  0x00007fe850ada2f5:   movl		%r11d, (%rsp)
  0x00007fe850ada2f9:   movl		0x48(%rsi, %rdx, 4), %r10d
  0x00007fe850ada2fe:   movl		%r10d, 4(%rsp)
  0x00007fe850ada303:   movl		0x10(%rsi, %rdx, 4), %r11d
  0x00007fe850ada308:   movl		0x14(%rsi, %rdx, 4), %r9d
  0x00007fe850ada30d:   movl		0x44(%rsi, %rdx, 4), %r10d
  0x00007fe850ada312:   movl		%r10d, 8(%rsp)
  0x00007fe850ada317:   movl		0x18(%rsi, %rdx, 4), %r8d
  0x00007fe850ada31c:   cmpl		%r11d, %eax
  0x00007fe850ada31f:   cmovll		%r11d, %eax
  0x00007fe850ada323:   cmpl		%r9d, %eax
  0x00007fe850ada326:   cmovll		%r9d, %eax
  0x00007fe850ada32a:   movl		0x20(%rsi, %rdx, 4), %r10d
  0x00007fe850ada32f:   cmpl		%r8d, %eax
  0x00007fe850ada332:   cmovll		%r8d, %eax
  0x00007fe850ada336:   movl		0x24(%rsi, %rdx, 4), %r8d
  0x00007fe850ada33b:   movl		0x28(%rsi, %rdx, 4), %r11d
                                                            ;   {no_reloc}
  0x00007fe850ada340:   movl		0x2c(%rsi, %rdx, 4), %ecx
  0x00007fe850ada344:   movl		0x30(%rsi, %rdx, 4), %r9d
  0x00007fe850ada349:   movl		0x34(%rsi, %rdx, 4), %edi
  0x00007fe850ada34d:   movl		0x38(%rsi, %rdx, 4), %ebx
  0x00007fe850ada351:   movl		0x3c(%rsi, %rdx, 4), %ebp
  0x00007fe850ada355:   movl		0x40(%rsi, %rdx, 4), %r13d
  0x00007fe850ada35a:   movl		0x1c(%rsi, %rdx, 4), %r14d
  0x00007fe850ada35f:   cmpl		%r14d, %eax
  0x00007fe850ada362:   cmovll		%r14d, %eax
  0x00007fe850ada366:   cmpl		%r10d, %eax
  0x00007fe850ada369:   cmovll		%r10d, %eax
  0x00007fe850ada36d:   cmpl		%r8d, %eax
  0x00007fe850ada370:   cmovll		%r8d, %eax
  0x00007fe850ada374:   cmpl		%r11d, %eax
  0x00007fe850ada377:   cmovll		%r11d, %eax
  0x00007fe850ada37b:   cmpl		%ecx, %eax
  0x00007fe850ada37d:   cmovll		%ecx, %eax
  0x00007fe850ada380:   cmpl		%r9d, %eax
  0x00007fe850ada383:   cmovll		%r9d, %eax
  0x00007fe850ada387:   cmpl		%edi, %eax
  0x00007fe850ada389:   cmovll		%edi, %eax
  0x00007fe850ada38c:   cmpl		%ebx, %eax
  0x00007fe850ada38e:   cmovll		%ebx, %eax
  0x00007fe850ada391:   cmpl		%ebp, %eax
  0x00007fe850ada393:   cmovll		%ebp, %eax
  0x00007fe850ada396:   cmpl		%r13d, %eax
  0x00007fe850ada399:   cmovll		%r13d, %eax
  0x00007fe850ada39d:   cmpl		8(%rsp), %eax
  0x00007fe850ada3a1:   movl		8(%rsp), %r11d
  0x00007fe850ada3a6:   cmovll		%r11d, %eax
  0x00007fe850ada3aa:   cmpl		4(%rsp), %eax
  0x00007fe850ada3ae:   movl		4(%rsp), %r10d
  0x00007fe850ada3b3:   cmovll		%r10d, %eax
  0x00007fe850ada3b7:   cmpl		(%rsp), %eax
  0x00007fe850ada3ba:   movl		(%rsp), %r11d
  0x00007fe850ada3be:   cmovll		%r11d, %eax         ;*istore_1 {reexecute=0 rethrow=0 return_oop=0}
                                                            ; - TestIntMax::test2 at 25 (line 71)

There are a couple of things is puzzling me. This test is like a reduction test and no vectorization appears to be kicking in any of the percentages (I've not enabled vectorization SW rejections to check). The other thing that is strange is the overall time. When no vectorization kicks in and the code uses cmovs, I've been seeing worse performance numbers compared to say compare and jumps, particularly in 100% tests. With `TestIntMax` it appears to be the opposite, test2 at 100% uses jpm+cmp, which performs worse than cmov versions.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/20098#issuecomment-2663665858