RFR: 8282204: Use lea instructions for arithmetic operations on x86_64 [v2]

Fri Feb 25 07:00:57 UTC 2022

On Thu, 24 Feb 2022 12:10:09 GMT, Quan Anh Mai <duke at openjdk.java.net> wrote:

>>> @vnkozlov Given `lea` is a really efficient instruction, merging multiple ones into it offers a lot of benefits and all other compilers do so.
>> 
>> So any benchmark to show the perf improvement?
>
> @DamonFool I found benchmarking these single arithmetic-instruction optimizations is hard, especially these rules which contain constant immediates. This is the result from a benchmark I wrote, my machine doesn't match 3-operand rules so the result of `B_I_D_*` is the same, and `B_IS_D_int` seems to suffer from loop alignment which leads to decoder bottleneck trying to read a lot of `nop`.
> Thank you very much.
> 
>     Before:
>     Benchmark                   Mode  Cnt     Score     Error  Units
>     LeaInstruction.B_IS_D_int   avgt   10  1171.089 ±  12.051  ns/op
>     LeaInstruction.B_IS_D_long  avgt   10  1214.248 ± 164.069  ns/op
>     LeaInstruction.B_IS_int     avgt   10   908.979 ±  57.721  ns/op
>     LeaInstruction.B_IS_long    avgt   10  1218.707 ±   2.169  ns/op
>     LeaInstruction.B_I_D_int    avgt   10   842.187 ±  65.795  ns/op
>     LeaInstruction.B_I_D_long   avgt   10  1289.333 ±   9.978  ns/op
>     LeaInstruction.IS_D_int     avgt   10   533.597 ±   1.302  ns/op
>     LeaInstruction.IS_D_long    avgt   10   533.198 ±   0.559  ns/op
> 
>     After:
>     Benchmark                   Mode  Cnt     Score    Error  Units
>     LeaInstruction.B_IS_D_int   avgt   10  1217.740 ±  4.110  ns/op
>     LeaInstruction.B_IS_D_long  avgt   10   809.962 ±  8.156  ns/op
>     LeaInstruction.B_IS_int     avgt   10   536.518 ±  5.076  ns/op
>     LeaInstruction.B_IS_long    avgt   10   534.041 ±  1.158  ns/op
>     LeaInstruction.B_I_D_int    avgt   10   808.131 ±  0.965  ns/op
>     LeaInstruction.B_I_D_long   avgt   10  1287.391 ± 10.025  ns/op
>     LeaInstruction.IS_D_int     avgt   10   305.940 ±  9.886  ns/op
>     LeaInstruction.IS_D_long    avgt   10   308.969 ±  0.844  ns/op

Thanks @merykitty for your benchmark.
I also saw the perf improvement with your micro bench.

I read the "Intel® 64 and IA-32 Architectures Optimization Reference Manual" :  https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf

There is `Assembly/Compiler Coding Rule 34`.

Assembly/Compiler Coding Rule 34. (ML impact, L generality) If an LEA instruction using the
scaled index is on the critical path, a sequence with ADDs may be better. If code density and bandwidth
out of the trace cache are the critical factor, then use the LEA instruction. 

Also please note :
![image](https://user-images.githubusercontent.com/19923746/155669026-28d11a2e-cf4c-4fd0-8d36-cd10b703107d.png)

So what do you think of these description?

-------------

PR: https://git.openjdk.java.net/jdk/pull/7560