RFR: 8291336: Add ideal rule to convert floating point multiply by 2 into addition
SuperCoder79
duke at openjdk.org
Sat Jul 30 03:04:16 UTC 2022
On Tue, 26 Jul 2022 15:39:42 GMT, SuperCoder79 <duke at openjdk.org> wrote:
> Hello,
> I would like to propose an ideal transform that converts floating point multiply by 2 (`x * 2`) into an addition operation instead. This would allow for the elimination of the memory reference for the constant two, and keep the whole operation inside registers. My justifications for this optimization include:
> * As per [Agner Fog's instruction tables](https://www.agner.org/optimize/instruction_tables.pdf) many older systems, such as the sandy bridge and ivy bridge architectures, have different latencies for addition and multiplication meaning this change could have beneficial effects when in hot code.
> * The removal of the memory load would have a beneficial effect in cache bound situations.
> * Multiplication by 2 is relatively common construct so this change can apply to a wide range of Java code.
>
> As this is my first time looking into the c2 codebase, I have a few lingering questions about my implementation and how certain parts of the compiler work. Mainly, is this patch getting the type of the operands correctly? I saw some cases where code used `bottom_type()` and other cases where it used `phase->type(value)`. Similarly, are nodes able to be reused as is being done in the AddNode constructors? I saw some places where the clone method was being used, but other places where it wasn't.
>
> I have attached an IR test and a jmh benchmark. Tier 1 testing passes on my machine.
>
> Thanks for your time,
> Jasmine
Hi, thank you for your assistance with this, I have updated the PR title and have applied the changes from code review. I have also updated the benchmark and have attached the results below. I tested the benchmark on 2 systems, a new one and an old one. The new system has a Ryzen 5 4500U cpu, and the results are as shown:
Baseline Patch
Benchmark Mode Cnt Score Error Units Score Error Units
TestMul2.testMul2Double avgt 10 209.740 ± 1.454 ns/op // 209.315 ± 1.116 ns/op (+0.20%)
TestMul2.testMul2Float avgt 10 210.871 ± 6.179 ns/op // 209.498 ± 0.777 ns/op (+0.65%)
The benchmark showed very little change on the new system, which is expected as the documentation states that both the `vaddsd` and `vmulsd` instructions have a latency of 3 cycles and a reciprocal throughput of 0.5. The slight gain could be from the elimination of the memory reference, or just from testing variance. The older system ran a Xeon x5690, and had these results:
Baseline Patch
Benchmark Mode Cnt Score Error Units Score Error Units
TestMul2.testMul2Double avgt 10 190.062 ± 9.695 ns/op // 170.393 ± 1.193 ns/op (+10.34%)
TestMul2.testMul2Float avgt 10 184.239 ± 1.983 ns/op // 171.329 ± 4.261 ns/op (+7.00%)
Due to the older system having a faster addition than multiplication, especially with double precision operations, far more substantial gains were realized here.
-------------
PR: https://git.openjdk.org/jdk/pull/9642
More information about the hotspot-compiler-dev
mailing list