RFR: 8373480: Optimize multiplication by constant multiplier using LEA instructions [v3]

Tue Dec 16 11:47:33 UTC 2025

> Emulate multiplier using LEA addressing scheme, where effective address = BASE + INDEX * SCALE + OFFSET
> Refer to section "3.5.1.2 Using LEA" of Intel's optimization manual for details reagarding slow vs fast lea instructions.
> Given that latency of IMUL with register operands is 3 cycles, a combination of two fast LEAs each with 1 cycle latency to emulate multipler is performant.
> 
> Consider X as the multiplicand, by variying the scale of  first LEA instruction we can generate 4 input i.e.
> 
> 
>     X + X * 1 = 2X
>     X + X * 2 = 3X
>     X + X * 4 = 5X
>     X + X * 8 = 9X
> 
> 
> Following table list downs various multiplier combinations for output of first LEA at BASE and/or INDEX by varying the 
> scale of second fast LEA instruction. We will only handle the cases which cannot be handled by just shift + add.
> 
> 
>       BASE   INDEX   SCALE  MULTIPLER
>         X      X       1       2       (Terminal)
>         X      X       2       3       (Terminal)
>         X      X       4       5       (Terminal)
>         X      X       8       9       (Terminal)
>        3X     3X       1       6
>         X     3X       2       7
>        5X     5X       1      10
>         X     5X       2      11
>         X     3X       4      13
>        5X     5X       2      15
>         X     2X       8      17
>        9X     9X       1      18
>         X     9X       2      19
>         X     5X       4      21
>        5X     5X       4      25
>        9X     9X       2      27
>         X     9X       4      37
>         X     5X       8      41
>        9X     9X       4      45
>         X     9X       8      73
>        9X     9X       8      81
> 
> 
> All the non-unity inputs tied to BASE / INDEX  are derived out of terminal cases which represent first FAST LEA. Thus, all the multipliers can be computed using just two LEA instructions.
> 
> Performance numbers for new micro benchmark included with this patch shows around **5-50% improvments** on latest x86 servers.
> 
> 
> System: INTEL(R) XEON(R) PLATINUM 8581C CPU @ 2.10GHz Emerald Rapids:-
> Baseline:-
> Benchmark                                             Mode  Cnt    Score   Error    Units
> ConstantMultiplierOptimization.testConstMultiplierI  thrpt    2  189.690          ops/min
> ConstantMultiplierOptimization.testConstMultiplierL  thrpt    2  196.283          ops/min
> 
> 
> Withopt:-
> Benchmark                                             Mode  Cnt    Score   Error    Units
> ConstantMultiplierOptimization.testConstMultiplierI  thrpt    2  283.827          ops/min
> ConstantMultiplierOptimization...

Jatin Bhateja has refreshed the contents of this pull request, and previous commits have been removed. The incremental views will show differences compared to the previous content of the PR. The pull request contains one new commit since the last revision:

  Using template-framework for JTREG test generation

-------------

Changes:
  - all: https://git.openjdk.org/jdk/pull/28759/files
  - new: https://git.openjdk.org/jdk/pull/28759/files/7489c7fe..d792c49b

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=28759&range=02
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=28759&range=01-02

  Stats: 0 lines in 0 files changed: 0 ins; 0 del; 0 mod
  Patch: https://git.openjdk.org/jdk/pull/28759.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/28759/head:pull/28759

PR: https://git.openjdk.org/jdk/pull/28759