RFR: 8346964: C2: Improve integer multiplication with constant in MulINode::Ideal()

Mon Feb 10 21:25:01 UTC 2025

On Tue, 7 Jan 2025 17:19:22 GMT, Emanuel Peter <epeter at openjdk.org> wrote:

>> test/hotspot/jtreg/compiler/c2/TestSerialAdditions.java line 148:
>> 
>>> 146:     @IR(counts = { IRNode.MUL_I, "1" })
>>> 147:     private static int addTo6(int a) {
>>> 148:         return a + a + a + a + a + a;
>> 
>> Is this an improvement?
>
> Is this an improvement on aarch64 for all implementations? What about x64?

If `a*6` is in a loop and can be vectorized, there may be big performance improvement. If it's not vectorized, there may be some small performance loss. See the test results of `a*18 = (a<<4) + (a<<1)`, (same with `a*6 = (a<<2) + (a<<1)`) in three different cases:

Benchmark        V2-now	V2-after  Uplift Genoa-now Genoa-after Uplift  Notes
testInt18        98.90  102.94	  0.96     142.48   140.75	1.01   scalar
testInt18AddSum  68.70	48.10     1.42     26.88    16.78	1.6   vectorized
testInt18Store   41.31	43.39     0.95     21.23    20.88	1.01  vectorized

We can see that for scalar case the conversion from `a*6 => (a<<2) + (a<<1)` is profitable on aarch64, I have a follow up patch to reimplement this pattern in aarch64 backend, I'll file it later. But for x64, there is no obvious performance change whether or not to do this conversion. So this is also why I leave a TODO in [mulnode.cpp](https://github.com/openjdk/jdk/pull/22922/files/193dc4e5760007784cffd64ef14e0050b0be92b3#diff-b1bd52f0743843e15452764f48ff43c15dd3192a28bfb684b34149f0e964996e)

>> test/hotspot/jtreg/compiler/loopopts/superword/TestAlignVector.java line 1059:
>> 
>>> 1057:             b[i * 6 + 1] = (byte) (a[i * 6 + 1] & mask);
>>> 1058:             b[i * 6 + 2] = (byte) (a[i * 6 + 2] & mask);
>>> 1059:             b[i * 6 + 3] = (byte) (a[i * 6 + 3] & mask);
>> 
>> Why did this change?
>> 
>> Was `i * 6` not supposed to be changed to `(i << 2) + (i << 1)`? This is the reason why the current impl of VPointer is not parsing this right, but after my patch https://github.com/openjdk/jdk/pull/21926 this will be fixed because we will parse the multiple occurances of `i` properly.
>> 
>> So it looks like you are now sometimes keeping it at `i * 6` instead of splitting it. Why?
>
> Also: this is very confusing: why does the result differ depending on `AlignVector`?

Without this patch, since the VPointer issue you mentioned this loop does not vectorize at all.
With this patch, `i*6` is not changed to `(i<<2) + (i<<1)`, so the VPointer issue is bypassed. So if `AlignVector` is false, this loop will be vecotized.

> why does the result differ depending on `AlignVector` ?

Because this loop operates on a discontinuous and unaligned array address space. If we require aligned vector (that is `AlignVector` is true), this loop will not be vectorized, otherwise it will be vectorized.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/22922#discussion_r1906240924
PR Review Comment: https://git.openjdk.org/jdk/pull/22922#discussion_r1906259743