RFR: 8373480: Optimize multiplication by constant multiplier using LEA instructions [v4]

Tue Dec 16 11:56:40 UTC 2025

> Emulate multiplier using LEA addressing scheme, where effective address = BASE + INDEX * SCALE + OFFSET
> Refer to section "3.5.1.2 Using LEA" of Intel's optimization manual for details reagarding slow vs fast lea instructions.
> Given that latency of IMUL with register operands is 3 cycles, a combination of two fast LEAs each with 1 cycle latency to emulate multipler is performant.
> 
> Consider X as the multiplicand, by variying the scale of  first LEA instruction we can generate 4 input i.e.
> 
> 
>     X + X * 1 = 2X
>     X + X * 2 = 3X
>     X + X * 4 = 5X
>     X + X * 8 = 9X
> 
> 
> Following table list downs various multiplier combinations for output of first LEA at BASE and/or INDEX by varying the 
> scale of second fast LEA instruction. We will only handle the cases which cannot be handled by just shift + add.
> 
> 
>       BASE   INDEX   SCALE  MULTIPLER
>         X      X       1       2       (Terminal)
>         X      X       2       3       (Terminal)
>         X      X       4       5       (Terminal)
>         X      X       8       9       (Terminal)
>        3X     3X       1       6
>         X     3X       2       7
>        5X     5X       1      10
>         X     5X       2      11
>         X     3X       4      13
>        5X     5X       2      15
>         X     2X       8      17
>        9X     9X       1      18
>         X     9X       2      19
>         X     5X       4      21
>        5X     5X       4      25
>        9X     9X       2      27
>         X     9X       4      37
>         X     5X       8      41
>        9X     9X       4      45
>         X     9X       8      73
>        9X     9X       8      81
> 
> 
> All the non-unity inputs tied to BASE / INDEX  are derived out of terminal cases which represent first FAST LEA. Thus, all the multipliers can be computed using just two LEA instructions.
> 
> Performance numbers for new micro benchmark included with this patch shows around **5-50% improvments** on latest x86 servers.
> 
> 
> System: INTEL(R) XEON(R) PLATINUM 8581C CPU @ 2.10GHz Emerald Rapids:-
> Baseline:-
> Benchmark                                             Mode  Cnt    Score   Error    Units
> ConstantMultiplierOptimization.testConstMultiplierI  thrpt    2  189.690          ops/min
> ConstantMultiplierOptimization.testConstMultiplierL  thrpt    2  196.283          ops/min
> 
> 
> Withopt:-
> Benchmark                                             Mode  Cnt    Score   Error    Units
> ConstantMultiplierOptimization.testConstMultiplierI  thrpt    2  283.827          ops/min
> ConstantMultiplierOptimization...

Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision:

  Minor cleanup in Template-Framework test

-------------

Changes:
  - all: https://git.openjdk.org/jdk/pull/28759/files
  - new: https://git.openjdk.org/jdk/pull/28759/files/d792c49b..66a28502

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=28759&range=03
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=28759&range=02-03

  Stats: 3 lines in 1 file changed: 0 ins; 3 del; 0 mod
  Patch: https://git.openjdk.org/jdk/pull/28759.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/28759/head:pull/28759

PR: https://git.openjdk.org/jdk/pull/28759