RFR: 8346964: C2: Improve integer multiplication with constant in MulINode::Ideal()

Mon Feb 10 21:25:01 UTC 2025

On Tue, 7 Jan 2025 17:30:04 GMT, Emanuel Peter <epeter at openjdk.org> wrote:

>> Constant multiplication x*C can be optimized as LEFT SHIFT, ADD or SUB instructions since generally these instructions have smaller latency and larger throughput on most architectures. For example:
>> 1. x*8 can be optimized as x<<3.
>> 2. x*9 can be optimized as x+x<<3, and x+x<<3 can be lowered as one SHIFT-ADD (ADD instruction combined with LEFT SHIFT) instruction on some architectures, like aarch64 and x86_64.
>> 
>> Currently OpenJDK implemented a few such patterns in mid-end, including:
>> 1. |C| = 1<<n                (n>0)
>> 2. |C| = (1<<n) - 1          (n>0)
>> 3. |C| = (1<<m) + (1<<n)     (m>n, n>=0)
>> 
>> The first two are ok. Because on most architectures they are lowered as only one ADD/SUB/SHIFT instruction.
>> 
>> But the third pattern doesn't always perform well on some architectures, such as aarch64. The third pattern can be split as the following sub patterns:
>> 3.1. C = (1<<n) + 1          (n>0)
>> 3.2. C = -((1<<n) + 1)       (n>0)
>> 3.3. C = (1<<m) + (1<<n)     (m>n, n>0)
>> 3.4. C = -((1<<m) + (1<<n))  (m>n, n>0)
>> 
>> According to Arm optimization guide, if the shift amount > 4, the latency and throughput of ADD instruction is the same with MUL instruction. So in this case, converting MUL to ADD is not profitable. Take a[i] * C on aarch64 as an example.
>> 
>> Before (MUL is not converted):
>> 
>>   mov x1, #C
>>   mul x2, x1, x0
>> 
>> 
>> Now (MUL is converted):
>>   For 3.1:
>> 
>>     add x2, x0, x0, lsl #n
>> 
>> 
>>   For 3.2:
>> 
>>     add x2, x0, x0, lsl #n  // same cost with mul if n > 4
>>     neg x2, x2
>> 
>> 
>>   For 3.3:
>> 
>>     lsl x1, x0, #m
>>     add x2, x1, x0, lsl #n  // same cost with mul if n > 4
>> 
>> 
>>   For 3.4:
>> 
>>     lsl x1, x0, #m
>>     add x2, x1, x0, lsl #n  // same cost with mul if n > 4
>>     neg x2, x2
>> 
>> 
>> Test results (ns/op) on Arm Neoverse V2:
>> 
>> 		Before	Now	Uplift		Pattern	Notes
>> testInt9	103.379	60.702	1.70305756	3.1
>> testIntN33	103.231	106.825	0.96635619	3.2	n > 4
>> testIntN9	103.448	103.005	1.004300762	3.2	n <= 4
>> testInt18	103.354	99.271	1.041129837	3.3	m <= 4, n <= 4
>> testInt36	103.396	99.186	1.042445506	3.3	m > 4, n <= 4
>> testInt96	103.337	105.416	0.980278136	3.3	m > 4, n > 4
>> testIntN18	103.333	139.258	0.742025593	3.4	m <= 4, n <= 4
>> testIntN36	103.208	139.132	0.741799155	3.4	m > 4, n <= 4
>> testIntN96	103.367	139.471	0.74113615	3.4	m > 4, n > 4
>> 
>> 
>> **(S1) From this point on, we should treat pattern 3 as follows:**
>> 3.1  C = (1<<n) + 1          (n>0)
>> 3.2  C = -((1<<n) + 1)       (0<n<=4)
>> 3.3  C...
>
> Hi @erifan
> 
> I have some first questions / comments. I only scanned through quickly.
> 
> My biggest question:
> You only mention aarch64. But we would need to know that your changes also work well on x64.
> 
> Also:
> Can you summarize if your changes are only for performane of vectorization, or also for scalar code?

Hi @eme64 , thanks for your review!

> My biggest question: You only mention aarch64. But we would need to know that your changes also work well on x64.

Yes, this patch also benefits x86_64 platform. I tested the patch on aarch64 V2 and N1 processors, AMD64 Genoa and Intel SPR processors (I can provide the test results if necessary). The test results show that for some cases the performance uplift is very large and there is no obvious performance degradation. I'm not familiar with x86_64 instruction set, so I didn't do any theoretical analysis on x86_64 platform, I did very detailed theoretical analysis on aarch64 platform.

I don't have machines with architectures other than aarch64 and x64, so this patch is not tested on platforms except for aarch64 and x64.

> Also: Can you summarize if your changes are only for performane of vectorization, or also for scalar code?

For both vectorization and scalar cases. For example the pattern x * C => -((x<<m) + (x<<n)) is supported now, this pattern is a negative optimization from both the vectorization and scalar perspectives on both aarch64 and x64, and this patch removed this pattern.

As mentioned in my commit message, this patch is good for most cases, but there are small performance loss for some cases. Ideally, if we implement vplan, I think it would be better to keep the multiplication operation before vectorization and let the vectorizer generate the optimal vector code. Then for the scalar case that cannot be vectorized, convert to other more optimal instructions in the backend ( or in [this pass](https://github.com/openjdk/jdk/pull/21599) if it's merged) if necessary.

> src/hotspot/share/opto/mulnode.cpp line 253:
> 
>> 251:   }
>> 252: 
>> 253:   // TODO: abs_con = (1<<m)+(1<<n) and con = -((1<<n)+1) was supported
> 
> Is this `TODO` here on purpose?

Yes, it's a reminder to implement these two patterns for scalar cases in backends if they are worthwhile on the specific architecture.

> src/hotspot/share/opto/mulnode.cpp line 263:
> 
>> 261:   //
>> 262:   // But if it's not vectorizable, maybe it's profitable to do the conversion on
>> 263:   // some architectures, support it in backends if it's worthwhile.
> 
> If it is all about vectorization only: we could consider delaying the `Ideal` optimization until after loop-opts. Then we can keep the multiplication for vectorization, and only use shift/add once we know that we cannot vectorize.

But for some cases, it's better to do the conversion before vectorization, for example:
`x * 8 => x << 3 `

Through test results and theoretical analysis (only on aarch64, see the commit message), I found that we'd better to do the conversion before vectorization if the multiplication can be transformed to less than one or two shift/add instructions.

> src/hotspot/share/opto/mulnode.cpp line 273:
> 
>> 271:         new LShiftINode(in(1), phase->intcon(log2i_exact(abs_con - 1))));
>> 272:     res = new AddINode(in(1), n1);
>> 273:   } else if (is_power_of_2(abs_con + 1)) {
> 
> So now you only check for `power_of_2   +- 1`, right? But before we also looked at patterns with 2 bits, such as `64 + 8`.
> 
> You would really need to prove that this is not a loss on any of the platforms we care about, incl. x64.

Yes, I have tested the patch on several of x64 machines, including amd Genoa and Intel SPR and some older x86 machines, there's no noticeable performance loss.  Test results on aarch64 and amd Genoa were included in the commit message, please take a look, thanks~

-------------

PR Comment: https://git.openjdk.org/jdk/pull/22922#issuecomment-2576610164
PR Review Comment: https://git.openjdk.org/jdk/pull/22922#discussion_r1906216214
PR Review Comment: https://git.openjdk.org/jdk/pull/22922#discussion_r1906220122
PR Review Comment: https://git.openjdk.org/jdk/pull/22922#discussion_r1906222636