RFR: 8346964: C2: Improve integer multiplication with constant in MulINode::Ideal()

Mon Feb 10 21:25:01 UTC 2025

On Wed, 8 Jan 2025 08:03:54 GMT, erifan <duke at openjdk.org> wrote:

>> Hi @erifan 
>> 
>> Thanks for your responses. I now looked at the benchmark results.
>> I see regressions in the range of 5% on both of your tested platforms. I'm hesitant to accept that now without the follow-up patches standing first. Maybe you can change the order of your RFE's so that we have no regressions in between?
>> 
>> Maybe you could even create one big patch with everything in it, just so that we can see that there are no regressions. Then split it up into parts (multiple RFEs) for easier review.
>> 
>> Also: It would be nice to see benchmarks on as many different architectures as you can show. And please make sure that the table is nicely aligned - currently it is a bit difficult to read.
>> 
>>> f we implement vplan
>> What do you mean by `VPlan`? Are you talking about LLVM? I am working on something a little similar with `VTransform`. But I'm not sure if it is relevant. I mean in theory you can also use the vector API, and then it would be nice if you had a vector mul, that this would also be changed into shifts if that is more profitable. Actually, maybe you should first address that: which vector mul could be vector shift. That would be a nice stand-alone change that you could implement without regressions, right?
>> 
>> What do you think?
>
>> Maybe you could even create one big patch with everything in it, just so that we can see that there are no regressions. Then split it up into parts (multiple RFEs) for easier review.
> 
> Ok, I'll combine the patches and file one big patch, and update the test results.
> 
>> Also: It would be nice to see benchmarks on as many different architectures as you can show. And please make sure that the table is nicely aligned - currently it is a bit difficult to read.
> 
> OK, I can provide test results on aarch64 V2, N1, AMD64 Genoa and Intel SPR processors.
> 
>> What do you mean by VPlan? Are you talking about LLVM? I am working on something a little similar with VTransform. But I'm not sure if it is relevant. I mean in theory you can also use the vector API, and then it would be nice if you had a vector mul, that this would also be changed into shifts if that is more profitable. Actually, maybe you should first address that: which vector mul could be vector shift. That would be a nice stand-alone change that you could implement without regressions, right?
> 
> Yes, I mean LLVM VPlan, I noticed your related work, it's very interesting.
> 
>> it would be nice if you had a vector mul, that this would also be changed into shifts if that is more profitable. 
> 
> Yes.

@erifan I did some more thinking when falling asleep / waking up. This is a really interesting problem here.

For `MulINode::Ideal` with patterns `var * con`, we really have these options in assembly:
- `mul` general case. 
- `shift` and `add` when profitable.
- `lea` could this be an improvement over `shift` and `add`?

The issue is that different platforms have different characteristics here for these instructions - we would have to see how they differ. As far as I remember `mul` is not always available on all `ALU`s, but `add` and `shift` should be available. This impacts their throughput (more ports / ALU means more throughput generally). But the instructions also have different latency. Further, I could imagine that at some point more instructions may not just affect the throughput, but also the code-size: that in turn would increase IR and may at some point affect the instruction cache.

Additionally: if your workload has other `mul`, `shift` and `add` mixed in, then some ports may already be saturated, and that could tilt the balance as to which option you are supposed to take.

And then the characteristics of scalar ops may not be identical to vector ops.

It would be interesting to have a really solid benchmark, where you explore the impact of these different effects.
And it would be interesting to extract a table of latency + throughput characteristics for all relevant scalar + vector ops, for a number of different CPUs. Just so we get an overview of how easy this is to tune.

Maybe perfect tuning is not possible. Maybe we are willing to take a `5%` regression in some cases to boost other cases by `30%`. But that is a **big maybe**: we really do not like getting regressions in existing code, it tends to upset people more if they get regressions compared to how much they enjoy speedups - so work like this can be delicate.

Anyway, I don't right now have much time to investigate and work on this myself. So you'd have to do the work, benchmark, explanation etc. **But I think the `30%` speedup indicates that this work could really have potential!**

As to what to do in sequence, here a suggestion:
1. First work on Vector API cases of vector multiplication - this should have no impact on other things.
2. Delay the `MulINode::Ideal` optimizations until after loop-opts: scalar code would still be handled in the old way, but auto-vectorized code would then be turned into `MulV`. And then go into the mul -> shift optimization for vectors under point 1.
3. Tackle `MulINode::Ideal` for scalar cases after loop-opts, and see what you can do there.

This way you can separate scalar and vector changes.
All of this really depends on very good benchmarks, and benchmarks on various platforms. Good presentation would be key here. I find tables with numbers important, but a visual representation on top would be good too - it can give an easier overview over the patterns.

And please investigate when `lea` is applicable / profitable.

We may also want to get people from ARM and Intel into this discussion at some point.

That's enough for now. Let me know what you think :)

-------------

PR Comment: https://git.openjdk.org/jdk/pull/22922#issuecomment-2579182815