RFR: 8346964: C2: Improve integer multiplication with constant in MulINode::Ideal()

Mon Feb 10 21:25:01 UTC 2025

On Thu, 9 Jan 2025 04:59:07 GMT, Emanuel Peter <epeter at openjdk.org> wrote:

>>> Maybe you could even create one big patch with everything in it, just so that we can see that there are no regressions. Then split it up into parts (multiple RFEs) for easier review.
>> 
>> Ok, I'll combine the patches and file one big patch, and update the test results.
>> 
>>> Also: It would be nice to see benchmarks on as many different architectures as you can show. And please make sure that the table is nicely aligned - currently it is a bit difficult to read.
>> 
>> OK, I can provide test results on aarch64 V2, N1, AMD64 Genoa and Intel SPR processors.
>> 
>>> What do you mean by VPlan? Are you talking about LLVM? I am working on something a little similar with VTransform. But I'm not sure if it is relevant. I mean in theory you can also use the vector API, and then it would be nice if you had a vector mul, that this would also be changed into shifts if that is more profitable. Actually, maybe you should first address that: which vector mul could be vector shift. That would be a nice stand-alone change that you could implement without regressions, right?
>> 
>> Yes, I mean LLVM VPlan, I noticed your related work, it's very interesting.
>> 
>>> it would be nice if you had a vector mul, that this would also be changed into shifts if that is more profitable. 
>> 
>> Yes.
>
> @erifan I did some more thinking when falling asleep / waking up. This is a really interesting problem here.
> 
> For `MulINode::Ideal` with patterns `var * con`, we really have these options in assembly:
> - `mul` general case. 
> - `shift` and `add` when profitable.
> - `lea` could this be an improvement over `shift` and `add`?
> 
> The issue is that different platforms have different characteristics here for these instructions - we would have to see how they differ. As far as I remember `mul` is not always available on all `ALU`s, but `add` and `shift` should be available. This impacts their throughput (more ports / ALU means more throughput generally). But the instructions also have different latency. Further, I could imagine that at some point more instructions may not just affect the throughput, but also the code-size: that in turn would increase IR and may at some point affect the instruction cache.
> 
> Additionally: if your workload has other `mul`, `shift` and `add` mixed in, then some ports may already be saturated, and that could tilt the balance as to which option you are supposed to take.
> 
> And then the characteristics of scalar ops may not be identical to vector ops.
> 
> It would be interesting to have a really solid benchmark, where you explore the impact of these different effects.
> And it would be interesting to extract a table of latency + throughput characteristics for all relevant scalar + vector ops, for a number of different CPUs. Just so we get an overview of how easy this is to tune.
> 
> Maybe perfect tuning is not possible. Maybe we are willing to take a `5%` regression in some cases to boost other cases by `30%`. But that is a **big maybe**: we really do not like getting regressions in existing code, it tends to upset people more if they get regressions compared to how much they enjoy speedups - so work like this can be delicate.
> 
> Anyway, I don't right now have much time to investigate and work on this myself. So you'd have to do the work, benchmark, explanation etc. **But I think the `30%` speedup indicates that this work could really have potential!**
> 
> As to what to do in sequence, here a suggestion:
> 1. First work on Vector API cases of vector multiplication - this should have no impact on other things.
> 2. Delay the `MulINode::Ideal` optimizations until after loop-opts: scalar code would still be handled in the old way, but auto-vectorized code would then be turned into `MulV`. And then go into the mul -> shift optimization for vectors under point 1.
> 3....

Hi @eme64  thanks for your review.

1. First work on Vector API cases of vector multiplication - this should have no impact on other things.
2. Delay the MulINode::Ideal optimizations until after loop-opts: scalar code would still be handled in the old way, but auto-vectorized code would then be turned into MulV. And then go into the mul -> shift optimization for vectors under point 1.
3. Tackle MulINode::Ideal for scalar cases after loop-opts, and see what you can do there.

I agree with you. I am actually working on `1`. The slightly troublesome thing is that `1` and `3` are both related to the architecture, so it might take a little more time.

> lea could this be an improvement over shift and add?

AARCH64 doesn't actually have a `lea` instruction. On x64 there are already some rules that turn `shift add` into `lea`.

The issue is that different platforms have different characteristics here for these instructions - we would have to see how they differ. As far as I remember mul is not always available on all ALUs, but add and shift should be available. This impacts their throughput (more ports / ALU means more throughput generally). But the instructions also have different latency. Further, I could imagine that at some point more instructions may not just affect the throughput, but also the code-size: that in turn would increase IR and may at some point affect the instruction cache.

Additionally: if your workload has other mul, shift and add mixed in, then some ports may already be saturated, and that could tilt the balance as to which option you are supposed to take.

And then the characteristics of scalar ops may not be identical to vector ops.

Yes this is very trick, the actual performance is related to many aspects, such as pipeline, latency, throughput, ROB, and even memory performance. We can only do optimization based on certain references and generalities, such as latency and throughput of different instructions. But when it comes to generalities, it is actually difficult to say which scenario is more general.

> It would be interesting to have a really solid benchmark, where you explore the impact of these different effects.
And it would be interesting to extract a table of latency + throughput characteristics for all relevant scalar + vector ops, for a number of different CPUs. Just so we get an overview of how easy this is to tune.

I don't know such a benchmark suite yet. For AARCH64, I usually refer to [the Arm Optimization Guide](https://developer.arm.com/documentation/109898/latest/), but some instructions seem to be missing there. I guess AMD and Intel should have similar documents? I'm not sure.

> Maybe perfect tuning is not possible. Maybe we are willing to take a 5% regression in some cases to boost other cases by 30%. But that is a big maybe: we really do not like getting regressions in existing code, it tends to upset people more if they get regressions compared to how much they enjoy speedups - so work like this can be delicate.

Yes I agree, I will deal with the performance loss.

> And please investigate when lea is applicable / profitable.

Do you mean shift add => lea ? I think this is already done on x64.

> We may also want to get people from ARM and Intel into this discussion at some point.

Yes.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/22922#issuecomment-2579263004