RFR: 8346964: C2: Improve integer multiplication with constant in MulINode::Ideal()

Mon Feb 10 21:25:01 UTC 2025

On Wed, 8 Jan 2025 03:00:06 GMT, erifan <duke at openjdk.org> wrote:

>> Hi @erifan
>> 
>> I have some first questions / comments. I only scanned through quickly.
>> 
>> My biggest question:
>> You only mention aarch64. But we would need to know that your changes also work well on x64.
>> 
>> Also:
>> Can you summarize if your changes are only for performane of vectorization, or also for scalar code?
>
> Hi @eme64 , thanks for your review!
> 
>> My biggest question: You only mention aarch64. But we would need to know that your changes also work well on x64.
> 
> Yes, this patch also benefits x86_64 platform. I tested the patch on aarch64 V2 and N1 processors, AMD64 Genoa and Intel SPR processors (I can provide the test results if necessary). The test results show that for some cases the performance uplift is very large and there is no obvious performance degradation. I'm not familiar with x86_64 instruction set, so I didn't do any theoretical analysis on x86_64 platform, I did very detailed theoretical analysis on aarch64 platform.
> 
> I don't have machines with architectures other than aarch64 and x64, so this patch is not tested on platforms except for aarch64 and x64.
> 
>> Also: Can you summarize if your changes are only for performane of vectorization, or also for scalar code?
> 
> For both vectorization and scalar cases. For example the pattern x * C => -((x<<m) + (x<<n)) is supported now, this pattern is a negative optimization from both the vectorization and scalar perspectives on both aarch64 and x64, and this patch removed this pattern.
> 
> As mentioned in my commit message, this patch is good for most cases, but there are small performance loss for some cases. Ideally, if we implement vplan, I think it would be better to keep the multiplication operation before vectorization and let the vectorizer generate the optimal vector code. Then for the scalar case that cannot be vectorized, convert to other more optimal instructions in the backend ( or in [this pass](https://github.com/openjdk/jdk/pull/21599) if it's merged) if necessary.

Hi @erifan 

Thanks for your responses. I now looked at the benchmark results.
I see regressions in the range of 5% on both of your tested platforms. I'm hesitant to accept that now without the follow-up patches standing first. Maybe you can change the order of your RFE's so that we have no regressions in between?

Maybe you could even create one big patch with everything in it, just so that we can see that there are no regressions. Then split it up into parts (multiple RFEs) for easier review.

Also: It would be nice to see benchmarks on as many different architectures as you can show. And please make sure that the table is nicely aligned - currently it is a bit difficult to read.

> f we implement vplan
What do you mean by `VPlan`? Are you talking about LLVM? I am working on something a little similar with `VTransform`. But I'm not sure if it is relevant. I mean in theory you can also use the vector API, and then it would be nice if you had a vector mul, that this would also be changed into shifts if that is more profitable. Actually, maybe you should first address that: which vector mul could be vector shift. That would be a nice stand-alone change that you could implement without regressions, right?

What do you think?

>> src/hotspot/share/opto/mulnode.cpp line 253:
>> 
>>> 251:   }
>>> 252: 
>>> 253:   // TODO: abs_con = (1<<m)+(1<<n) and con = -((1<<n)+1) was supported
>> 
>> Is this `TODO` here on purpose?
>
> Yes, it's a reminder to implement these two patterns for scalar cases in backends if they are worthwhile on the specific architecture.

Aha. I see. We don't leave TODO's in the code. Because nobody will ever look at it again. If we agree to go ahead with this, then you should rather file an RFE to keep track of this. But before you file it, let's first discuss the over-all strategy.

>> src/hotspot/share/opto/mulnode.cpp line 263:
>> 
>>> 261:   //
>>> 262:   // But if it's not vectorizable, maybe it's profitable to do the conversion on
>>> 263:   // some architectures, support it in backends if it's worthwhile.
>> 
>> If it is all about vectorization only: we could consider delaying the `Ideal` optimization until after loop-opts. Then we can keep the multiplication for vectorization, and only use shift/add once we know that we cannot vectorize.
>
> But for some cases, it's better to do the conversion before vectorization, for example:
> `x * 8 => x << 3 `
> 
> Through test results and theoretical analysis (only on aarch64, see the commit message), I found that we'd better to do the conversion before vectorization if the multiplication can be transformed to less than one or two shift/add instructions.

In theory we could also do a transform with vectors, and convert a vector mul to a vector shift, right?

It is a bit scary to have these cases where some are better before and some better after vectorization. Makes performance quite unpredictable - your change may introduce improvements in some cases but regressions in others.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/22922#issuecomment-2576897834
PR Review Comment: https://git.openjdk.org/jdk/pull/22922#discussion_r1906563179
PR Review Comment: https://git.openjdk.org/jdk/pull/22922#discussion_r1906567037