RFR: 8307513: C2: intrinsify Math.max(long,long) and Math.min(long,long) [v12]
Galder Zamarreño
galder at openjdk.org
Mon Feb 17 17:05:28 UTC 2025
On Fri, 7 Feb 2025 12:39:24 GMT, Galder Zamarreño <galder at openjdk.org> wrote:
>> This patch intrinsifies `Math.max(long, long)` and `Math.min(long, long)` in order to help improve vectorization performance.
>>
>> Currently vectorization does not kick in for loops containing either of these calls because of the following error:
>>
>>
>> VLoop::check_preconditions: failed: control flow in loop not allowed
>>
>>
>> The control flow is due to the java implementation for these methods, e.g.
>>
>>
>> public static long max(long a, long b) {
>> return (a >= b) ? a : b;
>> }
>>
>>
>> This patch intrinsifies the calls to replace the CmpL + Bool nodes for MaxL/MinL nodes respectively.
>> By doing this, vectorization no longer finds the control flow and so it can carry out the vectorization.
>> E.g.
>>
>>
>> SuperWord::transform_loop:
>> Loop: N518/N126 counted [int,int),+4 (1025 iters) main has_sfpt strip_mined
>> 518 CountedLoop === 518 246 126 [[ 513 517 518 242 521 522 422 210 ]] inner stride: 4 main of N518 strip mined !orig=[419],[247],[216],[193] !jvms: Test::test @ bci:14 (line 21)
>>
>>
>> Applying the same changes to `ReductionPerf` as in https://github.com/openjdk/jdk/pull/13056, we can compare the results before and after. Before the patch, on darwin/aarch64 (M1):
>>
>>
>> ==============================
>> Test summary
>> ==============================
>> TEST TOTAL PASS FAIL ERROR
>> jtreg:test/hotspot/jtreg/compiler/loopopts/superword/ReductionPerf.java
>> 1 1 0 0
>> ==============================
>> TEST SUCCESS
>>
>> long min 1155
>> long max 1173
>>
>>
>> After the patch, on darwin/aarch64 (M1):
>>
>>
>> ==============================
>> Test summary
>> ==============================
>> TEST TOTAL PASS FAIL ERROR
>> jtreg:test/hotspot/jtreg/compiler/loopopts/superword/ReductionPerf.java
>> 1 1 0 0
>> ==============================
>> TEST SUCCESS
>>
>> long min 1042
>> long max 1042
>>
>>
>> This patch does not add an platform-specific backend implementations for the MaxL/MinL nodes.
>> Therefore, it still relies on the macro expansion to transform those into CMoveL.
>>
>> I've run tier1 and hotspot compiler tests on darwin/aarch64 and got these results:
>>
>>
>> ==============================
>> Test summary
>> ==============================
>> TEST TOTAL PA...
>
> Galder Zamarreño has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 44 additional commits since the last revision:
>
> - Merge branch 'master' into topic.intrinsify-max-min-long
> - Fix typo
> - Renaming methods and variables and add docu on algorithms
> - Fix copyright years
> - Make sure it runs with cpus with either avx512 or asimd
> - Test can only run with 256 bit registers or bigger
>
> * Remove platform dependant check
> and use platform independent configuration instead.
> - Fix license header
> - Tests should also run on aarch64 asimd=true envs
> - Added comment around the assertions
> - Adjust min/max identity IR test expectations after changes
> - ... and 34 more: https://git.openjdk.org/jdk/compare/ba549afe...a190ae68
Another interesting comparison arises above when comparing `test2` in 80% vs 100%:
test2 (100%):
;; B12: # out( B21 B13 ) <- in( B11 B20 ) Freq: 1.6744e+09
0x00007f15bcada2e9: movl 0x14(%rsi, %rdx, 4), %r11d
;*iaload {reexecute=0 rethrow=0 return_oop=0}
; - TestIntMax::test2 at 14 (line 71)
0x00007f15bcada2ee: cmpl %r11d, %r10d
0x00007f15bcada2f1: jge 0x7f15bcada362 ;*istore_1 {reexecute=0 rethrow=0 return_oop=0}
; - TestIntMax::test2 at 25 (line 71)
test2(80%):
;; B10: # out( B10 B11 ) <- in( B9 B10 ) Loop( B10-B10 inner main of N64 strip mined) Freq: 1.6744e+09
0x00007fe850ada2f0: movl 0x4c(%rsi, %rdx, 4), %r11d
0x00007fe850ada2f5: movl %r11d, (%rsp)
0x00007fe850ada2f9: movl 0x48(%rsi, %rdx, 4), %r10d
0x00007fe850ada2fe: movl %r10d, 4(%rsp)
0x00007fe850ada303: movl 0x10(%rsi, %rdx, 4), %r11d
0x00007fe850ada308: movl 0x14(%rsi, %rdx, 4), %r9d
0x00007fe850ada30d: movl 0x44(%rsi, %rdx, 4), %r10d
0x00007fe850ada312: movl %r10d, 8(%rsp)
0x00007fe850ada317: movl 0x18(%rsi, %rdx, 4), %r8d
0x00007fe850ada31c: cmpl %r11d, %eax
0x00007fe850ada31f: cmovll %r11d, %eax
0x00007fe850ada323: cmpl %r9d, %eax
0x00007fe850ada326: cmovll %r9d, %eax
0x00007fe850ada32a: movl 0x20(%rsi, %rdx, 4), %r10d
0x00007fe850ada32f: cmpl %r8d, %eax
0x00007fe850ada332: cmovll %r8d, %eax
0x00007fe850ada336: movl 0x24(%rsi, %rdx, 4), %r8d
0x00007fe850ada33b: movl 0x28(%rsi, %rdx, 4), %r11d
; {no_reloc}
0x00007fe850ada340: movl 0x2c(%rsi, %rdx, 4), %ecx
0x00007fe850ada344: movl 0x30(%rsi, %rdx, 4), %r9d
0x00007fe850ada349: movl 0x34(%rsi, %rdx, 4), %edi
0x00007fe850ada34d: movl 0x38(%rsi, %rdx, 4), %ebx
0x00007fe850ada351: movl 0x3c(%rsi, %rdx, 4), %ebp
0x00007fe850ada355: movl 0x40(%rsi, %rdx, 4), %r13d
0x00007fe850ada35a: movl 0x1c(%rsi, %rdx, 4), %r14d
0x00007fe850ada35f: cmpl %r14d, %eax
0x00007fe850ada362: cmovll %r14d, %eax
0x00007fe850ada366: cmpl %r10d, %eax
0x00007fe850ada369: cmovll %r10d, %eax
0x00007fe850ada36d: cmpl %r8d, %eax
0x00007fe850ada370: cmovll %r8d, %eax
0x00007fe850ada374: cmpl %r11d, %eax
0x00007fe850ada377: cmovll %r11d, %eax
0x00007fe850ada37b: cmpl %ecx, %eax
0x00007fe850ada37d: cmovll %ecx, %eax
0x00007fe850ada380: cmpl %r9d, %eax
0x00007fe850ada383: cmovll %r9d, %eax
0x00007fe850ada387: cmpl %edi, %eax
0x00007fe850ada389: cmovll %edi, %eax
0x00007fe850ada38c: cmpl %ebx, %eax
0x00007fe850ada38e: cmovll %ebx, %eax
0x00007fe850ada391: cmpl %ebp, %eax
0x00007fe850ada393: cmovll %ebp, %eax
0x00007fe850ada396: cmpl %r13d, %eax
0x00007fe850ada399: cmovll %r13d, %eax
0x00007fe850ada39d: cmpl 8(%rsp), %eax
0x00007fe850ada3a1: movl 8(%rsp), %r11d
0x00007fe850ada3a6: cmovll %r11d, %eax
0x00007fe850ada3aa: cmpl 4(%rsp), %eax
0x00007fe850ada3ae: movl 4(%rsp), %r10d
0x00007fe850ada3b3: cmovll %r10d, %eax
0x00007fe850ada3b7: cmpl (%rsp), %eax
0x00007fe850ada3ba: movl (%rsp), %r11d
0x00007fe850ada3be: cmovll %r11d, %eax ;*istore_1 {reexecute=0 rethrow=0 return_oop=0}
; - TestIntMax::test2 at 25 (line 71)
There are a couple of things is puzzling me. This test is like a reduction test and no vectorization appears to be kicking in any of the percentages (I've not enabled vectorization SW rejections to check). The other thing that is strange is the overall time. When no vectorization kicks in and the code uses cmovs, I've been seeing worse performance numbers compared to say compare and jumps, particularly in 100% tests. With `TestIntMax` it appears to be the opposite, test2 at 100% uses jpm+cmp, which performs worse than cmov versions.
-------------
PR Comment: https://git.openjdk.org/jdk/pull/20098#issuecomment-2663665858
More information about the core-libs-dev
mailing list