RFR: 8346964: C2: Improve integer multiplication with constant in MulINode::Ideal()

Mon Feb 10 21:25:02 UTC 2025

On Wed, 8 Jan 2025 02:06:11 GMT, erifan <duke at openjdk.org> wrote:

>> Is this an improvement on aarch64 for all implementations? What about x64?
>
> If `a*6` is in a loop and can be vectorized, there may be big performance improvement. If it's not vectorized, there may be some small performance loss. See the test results of `a*18 = (a<<4) + (a<<1)`, (same with `a*6 = (a<<2) + (a<<1)`) in three different cases:
> 
> 
> Benchmark        V2-now	V2-after  Uplift Genoa-now Genoa-after Uplift  Notes
> testInt18        98.90  102.94	  0.96     142.48   140.75	1.01   scalar
> testInt18AddSum  68.70	48.10     1.42     26.88    16.78	1.6   vectorized
> testInt18Store   41.31	43.39     0.95     21.23    20.88	1.01  vectorized
> 
> 
> We can see that for scalar case the conversion from `a*6 => (a<<2) + (a<<1)` is profitable on aarch64, I have a follow up patch to reimplement this pattern in aarch64 backend, I'll file it later. But for x64, there is no obvious performance change whether or not to do this conversion. So this is also why I leave a TODO in [mulnode.cpp](https://github.com/openjdk/jdk/pull/22922/files/193dc4e5760007784cffd64ef14e0050b0be92b3#diff-b1bd52f0743843e15452764f48ff43c15dd3192a28bfb684b34149f0e964996e)

Benchmark        V2-now	V2-after  Uplift Genoa-now Genoa-after Uplift  Notes
testInt18        98.90  102.94	  0.96     142.48   140.75	1.01   scalar

Ok, that would be a 4% regression on V2. It is not much, but still possibly relevant.

I think I would need to see a clear strategy that we can actually pull off. Otherwise it may be that you introduce a regression here, that then nobody gets around to fixing later.

>> Also: this is very confusing: why does the result differ depending on `AlignVector`?
>
> Without this patch, since the VPointer issue you mentioned this loop does not vectorize at all.
> With this patch, `i*6` is not changed to `(i<<2) + (i<<1)`, so the VPointer issue is bypassed. So if `AlignVector` is false, this loop will be vecotized.
> 
>> why does the result differ depending on `AlignVector` ?
> 
> Because this loop operates on a discontinuous and unaligned array address space. If we require aligned vector (that is `AlignVector` is true), this loop will not be vectorized, otherwise it will be vectorized.

Right, that makes sense. I actually thought about that too after work yesterday evening.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/22922#discussion_r1906565583
PR Review Comment: https://git.openjdk.org/jdk/pull/22922#discussion_r1906561384