RFR: 8291336: Add ideal rule to convert floating point multiply by 2 into addition

Sat Jul 30 03:04:16 UTC 2022

On Tue, 26 Jul 2022 15:39:42 GMT, SuperCoder79 <duke at openjdk.org> wrote:

> Hello,
> I would like to propose an ideal transform that converts floating point multiply by 2 (`x * 2`) into an addition operation instead. This would allow for the elimination of the memory reference for the constant two, and keep the whole operation inside registers. My justifications for this optimization include:
> * As per [Agner Fog's instruction tables](https://www.agner.org/optimize/instruction_tables.pdf) many older systems, such as the sandy bridge and ivy bridge architectures, have different latencies for addition and multiplication meaning this change could have beneficial effects when in hot code.
> * The removal of the memory load would have a beneficial effect in cache bound situations.
> * Multiplication by 2 is relatively common construct so this change can apply to a wide range of Java code.
> 
> As this is my first time looking into the c2 codebase, I have a few lingering questions about my implementation and how certain parts of the compiler work. Mainly, is this patch getting the type of the operands correctly? I saw some cases where code used `bottom_type()` and other cases where it used `phase->type(value)`. Similarly, are nodes able to be reused as is being done in the AddNode constructors? I saw some places where the clone method was being used, but other places where it wasn't.
> 
> I have attached an IR test and a jmh benchmark. Tier 1 testing passes on my machine.
> 
> Thanks for your time,
> Jasmine

Hi, thank you for your assistance with this, I have updated the PR title and have applied the changes from code review. I have also updated the benchmark and have attached the results below. I tested the benchmark on 2 systems, a new one and an old one. The new system has a Ryzen 5 4500U cpu, and the results are as shown:

                                       Baseline                  Patch
Benchmark                 Mode  Cnt    Score   Error  Units      Score   Error  Units
TestMul2.testMul2Double   avgt   10  209.740 ± 1.454  ns/op // 209.315 ± 1.116  ns/op (+0.20%)
TestMul2.testMul2Float    avgt   10  210.871 ± 6.179  ns/op // 209.498 ± 0.777  ns/op (+0.65%)

The benchmark showed very little change on the new system, which is expected as the documentation states that both the `vaddsd` and `vmulsd` instructions have a latency of 3 cycles and a reciprocal throughput of 0.5. The slight gain could be from the elimination of the memory reference, or just from testing variance. The older system ran a Xeon x5690, and had these results:

                                       Baseline                  Patch
Benchmark                 Mode  Cnt    Score   Error  Units      Score   Error  Units
TestMul2.testMul2Double   avgt   10  190.062 ± 9.695  ns/op // 170.393 ± 1.193  ns/op (+10.34%)
TestMul2.testMul2Float    avgt   10  184.239 ± 1.983  ns/op // 171.329 ± 4.261  ns/op (+7.00%)

Due to the older system having a faster addition than multiplication, especially with double precision operations, far more substantial gains were realized here.

-------------

PR: https://git.openjdk.org/jdk/pull/9642