RFR: 8373480: Optimize multiplication by constant multiplier using LEA instructions [v3]
Jatin Bhateja
jbhateja at openjdk.org
Tue Dec 16 11:47:33 UTC 2025
> Emulate multiplier using LEA addressing scheme, where effective address = BASE + INDEX * SCALE + OFFSET
> Refer to section "3.5.1.2 Using LEA" of Intel's optimization manual for details reagarding slow vs fast lea instructions.
> Given that latency of IMUL with register operands is 3 cycles, a combination of two fast LEAs each with 1 cycle latency to emulate multipler is performant.
>
> Consider X as the multiplicand, by variying the scale of first LEA instruction we can generate 4 input i.e.
>
>
> X + X * 1 = 2X
> X + X * 2 = 3X
> X + X * 4 = 5X
> X + X * 8 = 9X
>
>
> Following table list downs various multiplier combinations for output of first LEA at BASE and/or INDEX by varying the
> scale of second fast LEA instruction. We will only handle the cases which cannot be handled by just shift + add.
>
>
> BASE INDEX SCALE MULTIPLER
> X X 1 2 (Terminal)
> X X 2 3 (Terminal)
> X X 4 5 (Terminal)
> X X 8 9 (Terminal)
> 3X 3X 1 6
> X 3X 2 7
> 5X 5X 1 10
> X 5X 2 11
> X 3X 4 13
> 5X 5X 2 15
> X 2X 8 17
> 9X 9X 1 18
> X 9X 2 19
> X 5X 4 21
> 5X 5X 4 25
> 9X 9X 2 27
> X 9X 4 37
> X 5X 8 41
> 9X 9X 4 45
> X 9X 8 73
> 9X 9X 8 81
>
>
> All the non-unity inputs tied to BASE / INDEX are derived out of terminal cases which represent first FAST LEA. Thus, all the multipliers can be computed using just two LEA instructions.
>
> Performance numbers for new micro benchmark included with this patch shows around **5-50% improvments** on latest x86 servers.
>
>
> System: INTEL(R) XEON(R) PLATINUM 8581C CPU @ 2.10GHz Emerald Rapids:-
> Baseline:-
> Benchmark Mode Cnt Score Error Units
> ConstantMultiplierOptimization.testConstMultiplierI thrpt 2 189.690 ops/min
> ConstantMultiplierOptimization.testConstMultiplierL thrpt 2 196.283 ops/min
>
>
> Withopt:-
> Benchmark Mode Cnt Score Error Units
> ConstantMultiplierOptimization.testConstMultiplierI thrpt 2 283.827 ops/min
> ConstantMultiplierOptimization...
Jatin Bhateja has refreshed the contents of this pull request, and previous commits have been removed. The incremental views will show differences compared to the previous content of the PR. The pull request contains one new commit since the last revision:
Using template-framework for JTREG test generation
-------------
Changes:
- all: https://git.openjdk.org/jdk/pull/28759/files
- new: https://git.openjdk.org/jdk/pull/28759/files/7489c7fe..d792c49b
Webrevs:
- full: https://webrevs.openjdk.org/?repo=jdk&pr=28759&range=02
- incr: https://webrevs.openjdk.org/?repo=jdk&pr=28759&range=01-02
Stats: 0 lines in 0 files changed: 0 ins; 0 del; 0 mod
Patch: https://git.openjdk.org/jdk/pull/28759.diff
Fetch: git fetch https://git.openjdk.org/jdk.git pull/28759/head:pull/28759
PR: https://git.openjdk.org/jdk/pull/28759
More information about the hotspot-dev
mailing list