RFR: 8373480: Optimize multiplication by constant multiplier using LEA instructions
Jatin Bhateja
jbhateja at openjdk.org
Mon Dec 15 17:27:12 UTC 2025
Emulate multiplier using LEA addressing scheme, where effective address = BASE + INDEX * SCALE + OFFSET
Refer to section "3.5.1.2 Using LEA" of Intel's optimization manual for details reagarding slow vs fast lea instructions.
Given that latency of IMUL with register operands is 3 cycles, a combination of two fast LEAs each with 1 cycle latency to emulate multipler is performant.
Consider X as the multiplicand, by variying the scale of first LEA instruction we can generate 4 input i.e.
X + X * 1 = 2X
X + X * 2 = 3X
X + X * 4 = 5X
X + X * 8 = 9X
Following table list downs various multiplier combinations for output of first LEA at BASE and/or INDEX by varying the
scale of second fast LEA instruction. We will only handle the cases which cannot be handled by just shift + add.
BASE INDEX SCALE MULTIPLER
X X 1 2 (Terminal)
X X 2 3 (Terminal)
X X 4 5 (Terminal)
X X 8 9 (Terminal)
3X 3X 1 6
X 3X 2 7
5X 5X 1 10
X 5X 2 11
X 3X 4 13
5X 5X 2 15
X 2X 8 17
9X 9X 1 18
X 9X 2 19
X 5X 4 21
5X 5X 4 25
9X 9X 2 27
X 9X 4 37
X 5X 8 41
9X 9X 4 45
X 9X 8 73
9X 9X 8 81
All the non-unity inputs tied to BASE / INDEX are derived out of terminal cases which represent first FAST LEA. Thus, all the multipliers can be computed using just two LEA instructions.
Performance numbers for new micro benchmark included with this patch shows around **5-50% improvments** on latest x86 servers.
System: INTEL(R) XEON(R) PLATINUM 8581C CPU @ 2.10GHz Emerald Rapids:-
Baseline:-
Benchmark Mode Cnt Score Error Units
ConstantMultiplierOptimization.testConstMultiplierI thrpt 2 189.690 ops/min
ConstantMultiplierOptimization.testConstMultiplierL thrpt 2 196.283 ops/min
Withopt:-
Benchmark Mode Cnt Score Error Units
ConstantMultiplierOptimization.testConstMultiplierI thrpt 2 283.827 ops/min
ConstantMultiplierOptimization.testConstMultiplierL thrpt 2 283.578 ops/min
System: AMD EPYC 9B45 Turin:-
Baseline:-
Benchmark Mode Cnt Score Error Units
ConstantMultiplierOptimization.testConstMultiplierI thrpt 2 393.299 ops/min
ConstantMultiplierOptimization.testConstMultiplierL thrpt 2 393.764 ops/min
Withopt:-
Benchmark Mode Cnt Score Error Units
ConstantMultiplierOptimization.testConstMultiplierI thrpt 2 409.790 ops/min
ConstantMultiplierOptimization.testConstMultiplierL thrpt 2 409.756 ops/min
Effect of optimization is more pronounced on Intel server in comparison to AMD's, As per Agner Fogs' instruction latency manual IMUL instruciton has resiprocal througput of 1 while Fast LEA has reciprocal througput of 0.25 on Zen4 and around 0.5 on Intel Sunnycove (Icelake), going by that we can executue more LEA in parallel in one cycle, it will need to be investigated seperately why we don't see similar gains on AMD target.
Kindly review and share your feedback.
Best Regards,
Jatin
-------------
Commit messages:
- Adding IR framework tests
- Adding benchmark
- 8373480: Optimize constant input multiplication using LEA instructions
Changes: https://git.openjdk.org/jdk/pull/28759/files
Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=28759&range=00
Issue: https://bugs.openjdk.org/browse/JDK-8373480
Stats: 535 lines in 8 files changed: 525 ins; 0 del; 10 mod
Patch: https://git.openjdk.org/jdk/pull/28759.diff
Fetch: git fetch https://git.openjdk.org/jdk.git pull/28759/head:pull/28759
PR: https://git.openjdk.org/jdk/pull/28759
More information about the hotspot-dev
mailing list