RFR: 8373480: Optimize multiplication by constant multiplier using LEA instructions

Mon Dec 15 17:27:12 UTC 2025

Emulate multiplier using LEA addressing scheme, where effective address = BASE + INDEX * SCALE + OFFSET
Refer to section "3.5.1.2 Using LEA" of Intel's optimization manual for details reagarding slow vs fast lea instructions.
Given that latency of IMUL with register operands is 3 cycles, a combination of two fast LEAs each with 1 cycle latency to emulate multipler is performant.

Consider X as the multiplicand, by variying the scale of  first LEA instruction we can generate 4 input i.e.

    X + X * 1 = 2X
    X + X * 2 = 3X
    X + X * 4 = 5X
    X + X * 8 = 9X

Following table list downs various multiplier combinations for output of first LEA at BASE and/or INDEX by varying the 
scale of second fast LEA instruction. We will only handle the cases which cannot be handled by just shift + add.

      BASE   INDEX   SCALE  MULTIPLER
        X      X       1       2       (Terminal)
        X      X       2       3       (Terminal)
        X      X       4       5       (Terminal)
        X      X       8       9       (Terminal)
       3X     3X       1       6
        X     3X       2       7
       5X     5X       1      10
        X     5X       2      11
        X     3X       4      13
       5X     5X       2      15
        X     2X       8      17
       9X     9X       1      18
        X     9X       2      19
        X     5X       4      21
       5X     5X       4      25
       9X     9X       2      27
        X     9X       4      37
        X     5X       8      41
       9X     9X       4      45
        X     9X       8      73
       9X     9X       8      81

All the non-unity inputs tied to BASE / INDEX  are derived out of terminal cases which represent first FAST LEA. Thus, all the multipliers can be computed using just two LEA instructions.

Performance numbers for new micro benchmark included with this patch shows around **5-50% improvments** on latest x86 servers.

System: INTEL(R) XEON(R) PLATINUM 8581C CPU @ 2.10GHz Emerald Rapids:-
Baseline:-
Benchmark                                             Mode  Cnt    Score   Error    Units
ConstantMultiplierOptimization.testConstMultiplierI  thrpt    2  189.690          ops/min
ConstantMultiplierOptimization.testConstMultiplierL  thrpt    2  196.283          ops/min

Withopt:-
Benchmark                                             Mode  Cnt    Score   Error    Units
ConstantMultiplierOptimization.testConstMultiplierI  thrpt    2  283.827          ops/min
ConstantMultiplierOptimization.testConstMultiplierL  thrpt    2  283.578          ops/min

System: AMD EPYC 9B45 Turin:-
Baseline:-
Benchmark                                             Mode  Cnt    Score   Error    Units
ConstantMultiplierOptimization.testConstMultiplierI  thrpt    2  393.299          ops/min
ConstantMultiplierOptimization.testConstMultiplierL  thrpt    2  393.764          ops/min

Withopt:-
Benchmark                                             Mode  Cnt    Score   Error    Units
ConstantMultiplierOptimization.testConstMultiplierI  thrpt    2  409.790          ops/min
ConstantMultiplierOptimization.testConstMultiplierL  thrpt    2  409.756          ops/min

Effect of optimization is more pronounced on Intel server in comparison to AMD's,  As per Agner Fogs' instruction latency manual IMUL instruciton has resiprocal througput of 1 while Fast LEA has reciprocal througput of 0.25 on Zen4 and around 0.5 on Intel Sunnycove (Icelake), going by that we can executue more LEA in parallel in one cycle, it will need to be investigated seperately why we don't see similar gains on AMD target.

Kindly review and share your feedback.

Best Regards,
Jatin

-------------

Commit messages:
 - Adding IR framework tests
 - Adding benchmark
 - 8373480: Optimize constant input multiplication using LEA instructions

Changes: https://git.openjdk.org/jdk/pull/28759/files
  Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=28759&range=00
  Issue: https://bugs.openjdk.org/browse/JDK-8373480
  Stats: 535 lines in 8 files changed: 525 ins; 0 del; 10 mod
  Patch: https://git.openjdk.org/jdk/pull/28759.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/28759/head:pull/28759

PR: https://git.openjdk.org/jdk/pull/28759