RFR: 8282204: Use lea instructions for arithmetic operations on x86_64 [v2]
Jie Fu
jiefu at openjdk.java.net
Fri Feb 25 07:00:57 UTC 2022
On Thu, 24 Feb 2022 12:10:09 GMT, Quan Anh Mai <duke at openjdk.java.net> wrote:
>>> @vnkozlov Given `lea` is a really efficient instruction, merging multiple ones into it offers a lot of benefits and all other compilers do so.
>>
>> So any benchmark to show the perf improvement?
>
> @DamonFool I found benchmarking these single arithmetic-instruction optimizations is hard, especially these rules which contain constant immediates. This is the result from a benchmark I wrote, my machine doesn't match 3-operand rules so the result of `B_I_D_*` is the same, and `B_IS_D_int` seems to suffer from loop alignment which leads to decoder bottleneck trying to read a lot of `nop`.
> Thank you very much.
>
> Before:
> Benchmark Mode Cnt Score Error Units
> LeaInstruction.B_IS_D_int avgt 10 1171.089 ± 12.051 ns/op
> LeaInstruction.B_IS_D_long avgt 10 1214.248 ± 164.069 ns/op
> LeaInstruction.B_IS_int avgt 10 908.979 ± 57.721 ns/op
> LeaInstruction.B_IS_long avgt 10 1218.707 ± 2.169 ns/op
> LeaInstruction.B_I_D_int avgt 10 842.187 ± 65.795 ns/op
> LeaInstruction.B_I_D_long avgt 10 1289.333 ± 9.978 ns/op
> LeaInstruction.IS_D_int avgt 10 533.597 ± 1.302 ns/op
> LeaInstruction.IS_D_long avgt 10 533.198 ± 0.559 ns/op
>
> After:
> Benchmark Mode Cnt Score Error Units
> LeaInstruction.B_IS_D_int avgt 10 1217.740 ± 4.110 ns/op
> LeaInstruction.B_IS_D_long avgt 10 809.962 ± 8.156 ns/op
> LeaInstruction.B_IS_int avgt 10 536.518 ± 5.076 ns/op
> LeaInstruction.B_IS_long avgt 10 534.041 ± 1.158 ns/op
> LeaInstruction.B_I_D_int avgt 10 808.131 ± 0.965 ns/op
> LeaInstruction.B_I_D_long avgt 10 1287.391 ± 10.025 ns/op
> LeaInstruction.IS_D_int avgt 10 305.940 ± 9.886 ns/op
> LeaInstruction.IS_D_long avgt 10 308.969 ± 0.844 ns/op
Thanks @merykitty for your benchmark.
I also saw the perf improvement with your micro bench.
I read the "Intel® 64 and IA-32 Architectures Optimization Reference Manual" : https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf
There is `Assembly/Compiler Coding Rule 34`.
Assembly/Compiler Coding Rule 34. (ML impact, L generality) If an LEA instruction using the
scaled index is on the critical path, a sequence with ADDs may be better. If code density and bandwidth
out of the trace cache are the critical factor, then use the LEA instruction.
Also please note :

So what do you think of these description?
-------------
PR: https://git.openjdk.java.net/jdk/pull/7560
More information about the hotspot-compiler-dev
mailing list