RFR: 8267687: ModXNode::Ideal optimization is better than Parse::do_irem

Fri Jun 4 05:32:59 UTC 2021

On Tue, 25 May 2021 15:34:44 GMT, Yi Yang <yyang at openjdk.org> wrote:

> Hi all,
> 
> Can I have a review of this change? I noticed there are two almost the same optimizations for % operation. For x%y, both Parse::do_irem and ModXNode::ideal are optimized for a special case that divisor y is `2^n` constant value. It turns out that ModXNode::Ideal optimization is better than Parse::do_irem in a simple microbenchmark(Please check out JBS attachment for more detailed benchmark result), 
> 
> ### Implementation
> - ModXNode::Ideal opt:
> https://github.com/openjdk/jdk/blob/cc687fd43ade6be8760c559f3ffa909c5937727c/src/hotspot/share/opto/divnode.cpp#L112-L160
> 
> - Parse::do_irem opt:
> https://github.com/openjdk/jdk/blob/cc687fd43ade6be8760c559f3ffa909c5937727c/src/hotspot/share/opto/parse2.cpp#L1171-L1196
> 
> ### JMH Microbenchmark
> - x64
> 
> ModXNode::Ideal opt:
> Benchmark                         Mode  Cnt     Score     Error  Units
> ModPowerOf2.testNegativePowerOf2  avgt   25  8746.608 ± 139.777  ns/op
> ModPowerOf2.testPositivePowerOf2  avgt   25  8735.545 ±  86.145  ns/op
> 
> Parse::do_irem opt:
> Benchmark                         Mode  Cnt     Score   Error  Units
> ModPowerOf2.testNegativePowerOf2  avgt   25  8693.797 ± 7.844  ns/op
> ModPowerOf2.testPositivePowerOf2  avgt   25  6618.652 ± 1.739  ns/op
> 
> 
> - AArch64
> 
> ModXNode::Ideal opt:
> Benchmark                         Mode  Cnt     Score   Error  Units
> ModPowerOf2.testNegativePowerOf2  avgt   25  6758.299 ± 4.452  ns/op
> ModPowerOf2.testPositivePowerOf2  avgt   25  6759.533 ± 2.791  ns/op
> 
> Parse::do_irem opt:
> Benchmark                         Mode  Cnt     Score   Error  Units
> ModPowerOf2.testNegativePowerOf2  avgt   25  6757.868 ± 4.044  ns/op
> ModPowerOf2.testPositivePowerOf2  avgt   25  6546.034 ± 0.794  ns/op
> 
> 
> In summary, ModXNode::Ideal is better than Parse::do_irem on x64 platform, benchmark results are also reproducible. I guess the reason is that Parse::do_irem generates cmp and jmp while ModXNode::Ideal does not. On AArch64 platform, the two optimizations are almost the same according to benchmark results. I am not familiar with AArch64 so I can not give a constructive answer.
> 
> 
> ### Assembly
> - source code
> 
> static int foo15(int k){
>     return k%8;
> }
> 
> 
> - x64
> 
> ModXNode::Ideal opt:
> 
> [Verified Entry Point]
>   # {method} {0x00007f0ded401ec8} 'foo15' '(I)I' in 'CheckIndex'
>   # parm0:    rsi       = int
>   #           [sp+0x20]  (sp of caller)
>   0x00007f0e4952e780:   sub    $0x18,%rsp
>   0x00007f0e4952e787:   mov    %rbp,0x10(%rsp)              ;*synchronization entry
>                                                             ; - CheckIndex::foo15 at -1 (line 125)
>   0x00007f0e4952e78c:   mov    %esi,%r10d
>   0x00007f0e4952e78f:   sar    $0x1f,%r10d
>   0x00007f0e4952e793:   shr    $0x1d,%r10d
>   0x00007f0e4952e797:   add    %esi,%r10d
>   0x00007f0e4952e79a:   and    $0xfffffffffffffff8,%r10d
>   0x00007f0e4952e79e:   mov    %esi,%eax
>   0x00007f0e4952e7a0:   sub    %r10d,%eax                   ;*irem {reexecute=0 rethrow=0 return_oop=0}
>                                                             ; - CheckIndex::foo15 at 3 (line 125)
>   0x00007f0e4952e7a3:   add    $0x10,%rsp
>   0x00007f0e4952e7a7:   pop    %rbp
>   0x00007f0e4952e7a8:   cmp    0x340(%r15),%rsp             ;   {poll_return}
>   0x00007f0e4952e7af:   ja     0x00007f0e4952e7b6
>   0x00007f0e4952e7b5:   retq   
>   0x00007f0e4952e7b6:   mov    $0x7f0e4952e7a8,%r10         ;   {internal_word}
>   0x00007f0e4952e7c0:   mov    %r10,0x358(%r15)
>   0x00007f0e4952e7c7:   jmpq   0x00007f0e49506100           ;   {runtime_call SafepointBlob}
>   0x00007f0e4952e7cc:   hlt  
>   						...   ; omit remaining hlt(s)
>   
> 
> 
> Parse::do_irem opt:
> 
> [Verified Entry Point]
>   # {method} {0x00007fbab9c01ec8} 'foo15' '(I)I' in 'CheckIndex'
>   # parm0:    rsi       = int
>   #           [sp+0x20]  (sp of caller)
>   0x00007fbb1552db00:   sub    $0x18,%rsp
>   0x00007fbb1552db07:   mov    %rbp,0x10(%rsp)              ;*synchronization entry
>                                                             ; - CheckIndex::foo15 at -1 (line 125)
>   0x00007fbb1552db0c:   test   %esi,%esi
>   0x00007fbb1552db0e:   jge    0x00007fbb1552db2c
>   0x00007fbb1552db10:   neg    %esi
>   0x00007fbb1552db12:   and    $0x7,%esi
>   0x00007fbb1552db15:   mov    %esi,%eax
>   0x00007fbb1552db17:   neg    %eax
>   0x00007fbb1552db19:   add    $0x10,%rsp
>   0x00007fbb1552db1d:   pop    %rbp
>   0x00007fbb1552db1e:   cmp    0x340(%r15),%rsp             ;   {poll_return}
>   0x00007fbb1552db25:   ja     0x00007fbb1552db33
>   0x00007fbb1552db2b:   retq   
>   0x00007fbb1552db2c:   mov    %esi,%eax
>   0x00007fbb1552db2e:   and    $0x7,%eax                    ;*irem {reexecute=0 rethrow=0 return_oop=0}
>                                                             ; - CheckIndex::foo15 at 3 (line 125)
>   0x00007fbb1552db31:   jmp    0x00007fbb1552db19
>   0x00007fbb1552db33:   mov    $0x7fbb1552db1e,%r10         ;   {internal_word}
>   0x00007fbb1552db3d:   mov    %r10,0x358(%r15)
>   0x00007fbb1552db44:   jmpq   0x00007fbb15505100           ;   {runtime_call SafepointBlob}
>   0x00007fbb1552db49:   hlt    
> 						...   ; omit remaining hlt(s)   
> 
> 
> - AArch64
> 
> ModXNode::Ideal opt:
> 
> [Verified Entry Point]
>   # {method} {0x0000ffff3d000218} 'foo15' '(I)I' in 'CheckIndex'
>   # parm0:    c_rarg1   = int
>   #           [sp+0x20]  (sp of caller)
>   0x0000ffff9d4f7580:   nop
>   0x0000ffff9d4f7584:   sub	sp, sp, #0x20
>   0x0000ffff9d4f7588:   stp	x29, x30, [sp, #16]         ;*synchronization entry
>                                                             ; - CheckIndex::foo15 at -1 (line 3)
>   0x0000ffff9d4f758c:   asr	w10, w1, #31
>   0x0000ffff9d4f7590:   add	w11, w1, w10, lsr #29
>   0x0000ffff9d4f7594:   and	w10, w11, #0xfffffff8
>   0x0000ffff9d4f7598:   sub	w0, w1, w10                 ;*irem {reexecute=0 rethrow=0 return_oop=0}
>                                                             ; - CheckIndex::foo15 at 3 (line 3)
>   0x0000ffff9d4f759c:   ldp	x29, x30, [sp, #16]
>   0x0000ffff9d4f75a0:   add	sp, sp, #0x20
>   0x0000ffff9d4f75a4:   ldr	x8, [x28, #832]             ;   {poll_return}
>   0x0000ffff9d4f75a8:   cmp	sp, x8
>   0x0000ffff9d4f75ac:   b.hi	0x0000ffff9d4f75b4  // b.pmore
>   0x0000ffff9d4f75b0:   ret
>   0x0000ffff9d4f75b4:   adr	x8, 0x0000ffff9d4f75a4      ;   {internal_word}
>   0x0000ffff9d4f75b8:   str	x8, [x28, #856]
>   0x0000ffff9d4f75bc:   b	0x0000ffff9d4cd600          ;   {runtime_call SafepointBlob}
> 
> 
> Parse::do_irem opt:
> 
> [Verified Entry Point]
>   # {method} {0x0000ffff1c800218} 'foo15' '(I)I' in 'CheckIndex'
>   # parm0:    c_rarg1   = int
>   #           [sp+0x20]  (sp of caller)
>   0x0000ffff7d4f7580:   nop
>   0x0000ffff7d4f7584:   sub	sp, sp, #0x20
>   0x0000ffff7d4f7588:   stp	x29, x30, [sp, #16]         ;*synchronization entry
>                                                             ; - CheckIndex::foo15 at -1 (line 3)
>   0x0000ffff7d4f758c:   tbz	w1, #31, 0x0000ffff7d4f75b4
>   0x0000ffff7d4f7590:   neg	w10, w1
>   0x0000ffff7d4f7594:   and	w11, w10, #0x7
>   0x0000ffff7d4f7598:   neg	w0, w11
>   0x0000ffff7d4f759c:   ldp	x29, x30, [sp, #16]
>   0x0000ffff7d4f75a0:   add	sp, sp, #0x20
>   0x0000ffff7d4f75a4:   ldr	x8, [x28, #832]             ;   {poll_return}
>   0x0000ffff7d4f75a8:   cmp	sp, x8
>   0x0000ffff7d4f75ac:   b.hi	0x0000ffff7d4f75bc  // b.pmore
>   0x0000ffff7d4f75b0:   ret
>   0x0000ffff7d4f75b4:   and	w0, w1, #0x7                ;*irem {reexecute=0 rethrow=0 return_oop=0}
>                                                             ; - CheckIndex::foo15 at 3 (line 3)
>   0x0000ffff7d4f75b8:   b	0x0000ffff7d4f759c
>   0x0000ffff7d4f75bc:   adr	x8, 0x0000ffff7d4f75a4      ;   {internal_word}
>   0x0000ffff7d4f75c0:   str	x8, [x28, #856]
>   0x0000ffff7d4f75c4:   b	0x0000ffff7d4cd600          ;   {runtime_call SafepointBlob}
>   0x0000ffff7d4f75c8:   .inst	0x00000000 ; undefined
>                         ...   ; omit remaining .inst(s)
> 
> 
> ### Ideal graph:
> ----------------
> ![ideal_graph](https://user-images.githubusercontent.com/5010047/119525589-34585f80-bdb1-11eb-9d7e-e3962cd7f789.jpg)
> (Parse opt(left) vs Ideal opt(right))
> 
> Thanks!
> Yang

Hi Vladimir,

> I have question about benchmarks results. The unit is `ns/op` - the less number is better. And `Parse::do_irem` gives smaller numbers. Why then `Ideal` is better?

Thanks for your reminder, I overlook the unit `ns/op`.

> I understand that logically code without branch should be faster. But if the branch is predicted by CPU it may be not the case.
> I am for to have only one place to do the optimization. But I want to see correct numbers which we can based on.
> Also would be nice to see numbers for mixed (negative and positive) loop.

It seems test is indeed affected by the branch prediction. When I add mixed computations loop, ModINode::Ideal is better than Parse::do_irem

x64
-----
Parse::do_irem
Benchmark                         Mode  Cnt      Score    Error  Units
ModPowerOf2.testMixedPowerOf2     avgt   25  13258.945 ± 12.420  ns/op
ModPowerOf2.testNegativePowerOf2  avgt   25   8685.782 ± 16.010  ns/op
ModPowerOf2.testPositivePowerOf2  avgt   25   6604.848 ±  1.402  ns/op

ModINode::Ideal
Benchmark                         Mode  Cnt      Score    Error  Units
ModPowerOf2.testMixedPowerOf2     avgt   25  12918.206 ± 11.730  ns/op
ModPowerOf2.testNegativePowerOf2  avgt   25   8675.248 ±  3.279  ns/op
ModPowerOf2.testPositivePowerOf2  avgt   25   8678.828 ±  6.336  ns/op

AArch64:
------
Parse::do_irem
Benchmark                         Mode  Cnt      Score   Error  Units
ModPowerOf2.testMixedPowerOf2     avgt   25  12340.268 ± 1.015  ns/op
ModPowerOf2.testNegativePowerOf2  avgt   25   6752.467 ± 2.671  ns/op
ModPowerOf2.testPositivePowerOf2  avgt   25   6545.322 ± 0.666  ns/op

ModINode::Ideal
Benchmark                         Mode  Cnt      Score   Error  Units
ModPowerOf2.testMixedPowerOf2     avgt   25  12339.301 ± 1.406  ns/op
ModPowerOf2.testNegativePowerOf2  avgt   25   6753.852 ± 3.460  ns/op
ModPowerOf2.testPositivePowerOf2  avgt   25   6752.895 ± 4.207  ns/op

So I must correct my previous conclusion. On aarch64, there is no obvious difference between these two optimizations. On x64, Parse::do_irem is better than ModINode::Ideal w/ branch prediction, otherwise ModINode::Ideal is a better choice.
[new_bench_aarch64.log](https://github.com/openjdk/jdk/files/6595966/new_bench_aarch64.log)
[new_bench_x64.log](https://github.com/openjdk/jdk/files/6595967/new_bench_x64.log)

-------------

PR: https://git.openjdk.java.net/jdk/pull/4188