RFR: 8373480: Optimize multiplication by constant multiplier using LEA instructions [v4]
Galder Zamarreño
galder at openjdk.org
Mon Dec 22 06:01:52 UTC 2025
On Tue, 16 Dec 2025 11:56:40 GMT, Jatin Bhateja <jbhateja at openjdk.org> wrote:
>> Emulate multiplier using LEA addressing scheme, where effective address = BASE + INDEX * SCALE + OFFSET
>> Refer to section "3.5.1.2 Using LEA" of Intel's optimization manual for details reagarding slow vs fast lea instructions.
>> Given that latency of IMUL with register operands is 3 cycles, a combination of two fast LEAs each with 1 cycle latency to emulate multipler is performant.
>>
>> Consider X as the multiplicand, by variying the scale of first LEA instruction we can generate 4 input i.e.
>>
>>
>> X + X * 1 = 2X
>> X + X * 2 = 3X
>> X + X * 4 = 5X
>> X + X * 8 = 9X
>>
>>
>> Following table list downs various multiplier combinations for output of first LEA at BASE and/or INDEX by varying the
>> scale of second fast LEA instruction. We will only handle the cases which cannot be handled by just shift + add.
>>
>>
>> BASE INDEX SCALE MULTIPLER
>> X X 1 2 (Terminal)
>> X X 2 3 (Terminal)
>> X X 4 5 (Terminal)
>> X X 8 9 (Terminal)
>> 3X 3X 1 6
>> X 3X 2 7
>> 5X 5X 1 10
>> X 5X 2 11
>> X 3X 4 13
>> 5X 5X 2 15
>> X 2X 8 17
>> 9X 9X 1 18
>> X 9X 2 19
>> X 5X 4 21
>> 5X 5X 4 25
>> 9X 9X 2 27
>> X 9X 4 37
>> X 5X 8 41
>> 9X 9X 4 45
>> X 9X 8 73
>> 9X 9X 8 81
>>
>>
>> All the non-unity inputs tied to BASE / INDEX are derived out of terminal cases which represent first FAST LEA. Thus, all the multipliers can be computed using just two LEA instructions.
>>
>> Performance numbers for new micro benchmark included with this patch shows around **5-50% improvments** on latest x86 servers.
>>
>>
>> System: INTEL(R) XEON(R) PLATINUM 8581C CPU @ 2.10GHz Emerald Rapids:-
>> Baseline:-
>> Benchmark Mode Cnt Score Error Units
>> ConstantMultiplierOptimization.testConstMultiplierI thrpt 2 189.690 ops/min
>> ConstantMultiplierOptimization.testConstMultiplierL thrpt 2 196.283 ops/min
>>
>>
>> Withopt:-
>> Benchmark Mode Cnt Score Error Units
>> Constant...
>
> Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision:
>
> Minor cleanup in Template-Framework test
Nice improvement @jatin-bhateja, just some small comments
Matt did an [Advent of Compiler Optimizations video](https://youtu.be/1X88od0miHs?si=wlYCsbZ1vmJA_rVf) precisely on this topic recently :)
test/hotspot/jtreg/compiler/c2/TestConstantMultiplier.java line 62:
> 60:
> 61: // Add a java source file.
> 62: comp.addJavaSourceCode("c2.compilerr.ConstantMultiplierTest", generate(comp));
Suggestion:
comp.addJavaSourceCode("c2.compiler.ConstantMultiplierTest", generate(comp));
Package name spelling mistake.
test/micro/org/openjdk/bench/vm/compiler/ConstantMultiplierOptimization.java line 46:
> 44: public class ConstantMultiplierOptimization {
> 45:
> 46: public static int mul_by_25_I(int a) {
Should this and other methods be annotated with `@ForceInline` just in case?
-------------
Changes requested by galder (Author).
PR Review: https://git.openjdk.org/jdk/pull/28759#pullrequestreview-3602769564
PR Review Comment: https://git.openjdk.org/jdk/pull/28759#discussion_r2638758416
PR Review Comment: https://git.openjdk.org/jdk/pull/28759#discussion_r2638761774
More information about the hotspot-dev
mailing list