RFR: 8373480: Optimize multiplication by constant multiplier using LEA instructions [v6]
Jatin Bhateja
jbhateja at openjdk.org
Mon Dec 22 12:09:01 UTC 2025
> Emulate multiplier using LEA addressing scheme, where effective address = BASE + INDEX * SCALE + OFFSET
> Refer to section "3.5.1.2 Using LEA" of Intel's optimization manual for details reagarding slow vs fast lea instructions.
> Given that latency of IMUL with register operands is 3 cycles, a combination of two fast LEAs each with 1 cycle latency to emulate multipler is performant.
>
> Consider X as the multiplicand, by variying the scale of first LEA instruction we can generate 4 input i.e.
>
>
> X + X * 1 = 2X
> X + X * 2 = 3X
> X + X * 4 = 5X
> X + X * 8 = 9X
>
>
> Following table list downs various multiplier combinations for output of first LEA at BASE and/or INDEX by varying the
> scale of second fast LEA instruction. We will only handle the cases which cannot be handled by just shift + add.
>
>
> BASE INDEX SCALE MULTIPLER
> X X 1 2 (Terminal)
> X X 2 3 (Terminal)
> X X 4 5 (Terminal)
> X X 8 9 (Terminal)
> 3X 3X 1 6
> X 3X 2 7
> 5X 5X 1 10
> X 5X 2 11
> X 3X 4 13
> 5X 5X 2 15
> X 2X 8 17
> 9X 9X 1 18
> X 9X 2 19
> X 5X 4 21
> 5X 5X 4 25
> 9X 9X 2 27
> X 9X 4 37
> X 5X 8 41
> 9X 9X 4 45
> X 9X 8 73
> 9X 9X 8 81
>
>
> All the non-unity inputs tied to BASE / INDEX are derived out of terminal cases which represent first FAST LEA. Thus, all the multipliers can be computed using just two LEA instructions.
>
> Performance numbers for new micro benchmark included with this patch shows around **5-50% improvments** on latest x86 servers.
>
>
> System: INTEL(R) XEON(R) PLATINUM 8581C CPU @ 2.10GHz Emerald Rapids:-
> Baseline:-
> Benchmark Mode Cnt Score Error Units
> ConstantMultiplierOptimization.testConstMultiplierI thrpt 2 189.690 ops/min
> ConstantMultiplierOptimization.testConstMultiplierL thrpt 2 196.283 ops/min
>
>
> Withopt:-
> Benchmark Mode Cnt Score Error Units
> ConstantMultiplierOptimization.testConstMultiplierI thrpt 2 283.827 ops/min
> ConstantMultiplierOptimization...
Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision:
Extending micro and jtreg tests for memory patterns
-------------
Changes:
- all: https://git.openjdk.org/jdk/pull/28759/files
- new: https://git.openjdk.org/jdk/pull/28759/files/8e42a466..b7756730
Webrevs:
- full: https://webrevs.openjdk.org/?repo=jdk&pr=28759&range=05
- incr: https://webrevs.openjdk.org/?repo=jdk&pr=28759&range=04-05
Stats: 154 lines in 5 files changed: 142 ins; 0 del; 12 mod
Patch: https://git.openjdk.org/jdk/pull/28759.diff
Fetch: git fetch https://git.openjdk.org/jdk.git pull/28759/head:pull/28759
PR: https://git.openjdk.org/jdk/pull/28759
More information about the hotspot-dev
mailing list