RFR: 8346964: C2: Improve integer multiplication with constant in MulINode::Ideal() [v3]

Wed Apr 2 07:15:27 UTC 2025

On Thu, 9 Jan 2025 06:21:14 GMT, erifan <duke at openjdk.org> wrote:

>> @erifan I did some more thinking when falling asleep / waking up. This is a really interesting problem here.
>> 
>> For `MulINode::Ideal` with patterns `var * con`, we really have these options in assembly:
>> - `mul` general case. 
>> - `shift` and `add` when profitable.
>> - `lea` could this be an improvement over `shift` and `add`?
>> 
>> The issue is that different platforms have different characteristics here for these instructions - we would have to see how they differ. As far as I remember `mul` is not always available on all `ALU`s, but `add` and `shift` should be available. This impacts their throughput (more ports / ALU means more throughput generally). But the instructions also have different latency. Further, I could imagine that at some point more instructions may not just affect the throughput, but also the code-size: that in turn would increase IR and may at some point affect the instruction cache.
>> 
>> Additionally: if your workload has other `mul`, `shift` and `add` mixed in, then some ports may already be saturated, and that could tilt the balance as to which option you are supposed to take.
>> 
>> And then the characteristics of scalar ops may not be identical to vector ops.
>> 
>> It would be interesting to have a really solid benchmark, where you explore the impact of these different effects.
>> And it would be interesting to extract a table of latency + throughput characteristics for all relevant scalar + vector ops, for a number of different CPUs. Just so we get an overview of how easy this is to tune.
>> 
>> Maybe perfect tuning is not possible. Maybe we are willing to take a `5%` regression in some cases to boost other cases by `30%`. But that is a **big maybe**: we really do not like getting regressions in existing code, it tends to upset people more if they get regressions compared to how much they enjoy speedups - so work like this can be delicate.
>> 
>> Anyway, I don't right now have much time to investigate and work on this myself. So you'd have to do the work, benchmark, explanation etc. **But I think the `30%` speedup indicates that this work could really have potential!**
>> 
>> As to what to do in sequence, here a suggestion:
>> 1. First work on Vector API cases of vector multiplication - this should have no impact on other things.
>> 2. Delay the `MulINode::Ideal` optimizations until after loop-opts: scalar code would still be handled in the old way, but auto-vectorized code would then be turned into `MulV`. And then go into the mul -> sh...
>
> Hi @eme64  thanks for your review.
> 
> 1. First work on Vector API cases of vector multiplication - this should have no impact on other things.
> 2. Delay the MulINode::Ideal optimizations until after loop-opts: scalar code would still be handled in the old way, but auto-vectorized code would then be turned into MulV. And then go into the mul -> shift optimization for vectors under point 1.
> 3. Tackle MulINode::Ideal for scalar cases after loop-opts, and see what you can do there.
> 
> I agree with you. I am actually working on `1`. The slightly troublesome thing is that `1` and `3` are both related to the architecture, so it might take a little more time.
> 
>> lea could this be an improvement over shift and add?
> 
> AARCH64 doesn't actually have a `lea` instruction. On x64 there are already some rules that turn `shift add` into `lea`.
> 
> The issue is that different platforms have different characteristics here for these instructions - we would have to see how they differ. As far as I remember mul is not always available on all ALUs, but add and shift should be available. This impacts their throughput (more ports / ALU means more throughput generally). But the instructions also have different latency. Further, I could imagine that at some point more instructions may not just affect the throughput, but also the code-size: that in turn would increase IR and may at some point affect the instruction cache.
> 
> Additionally: if your workload has other mul, shift and add mixed in, then some ports may already be saturated, and that could tilt the balance as to which option you are supposed to take.
> 
> And then the characteristics of scalar ops may not be identical to vector ops.
> 
> 
> Yes this is very trick, the actual performance is related to many aspects, such as pipeline, latency, throughput, ROB, and even memory performance. We can only do optimization based on certain references and generalities, such as latency and throughput of different instructions. But when it comes to generalities, it is actually difficult to say which scenario is more general.
> 
>> It would be interesting to have a really solid benchmark, where you explore the impact of these different effects.
> And it would be interesting to extract a table of latency + throughput characteristics for all relevant scalar + vector ops, for a number of different CPUs. Just so we get an overview of how easy this is to tune.
> 
> I don't know such a benchmark suite yet. For AARCH64, I usually refer to [the Arm Optimization Guide](https:...

@erifan you opened this again. Does that mean we should review again? I see that you did not make any changes since our last conversation. If it is not ready for review, could you please convert it to Draft, so it is clear that you are not asking for reviews currently?

-------------

PR Comment: https://git.openjdk.org/jdk/pull/22922#issuecomment-2771545449