RFR: 8346964: C2: Improve integer multiplication with constant in MulINode::Ideal()

Mon Feb 10 21:25:01 UTC 2025

On Mon, 6 Jan 2025 07:55:39 GMT, erifan <duke at openjdk.org> wrote:

> Constant multiplication x*C can be optimized as LEFT SHIFT, ADD or SUB instructions since generally these instructions have smaller latency and larger throughput on most architectures. For example:
> 1. x*8 can be optimized as x<<3.
> 2. x*9 can be optimized as x+x<<3, and x+x<<3 can be lowered as one SHIFT-ADD (ADD instruction combined with LEFT SHIFT) instruction on some architectures, like aarch64 and x86_64.
> 
> Currently OpenJDK implemented a few such patterns in mid-end, including:
> 1. |C| = 1<<n                (n>0)
> 2. |C| = (1<<n) - 1          (n>0)
> 3. |C| = (1<<m) + (1<<n)     (m>n, n>=0)
> 
> The first two are ok. Because on most architectures they are lowered as only one ADD/SUB/SHIFT instruction.
> 
> But the third pattern doesn't always perform well on some architectures, such as aarch64. The third pattern can be split as the following sub patterns:
> 3.1. C = (1<<n) + 1          (n>0)
> 3.2. C = -((1<<n) + 1)       (n>0)
> 3.3. C = (1<<m) + (1<<n)     (m>n, n>0)
> 3.4. C = -((1<<m) + (1<<n))  (m>n, n>0)
> 
> According to Arm optimization guide, if the shift amount > 4, the latency and throughput of ADD instruction is the same with MUL instruction. So in this case, converting MUL to ADD is not profitable. Take a[i] * C on aarch64 as an example.
> 
> Before (MUL is not converted):
> 
>   mov x1, #C
>   mul x2, x1, x0
> 
> 
> Now (MUL is converted):
>   For 3.1:
> 
>     add x2, x0, x0, lsl #n
> 
> 
>   For 3.2:
> 
>     add x2, x0, x0, lsl #n  // same cost with mul if n > 4
>     neg x2, x2
> 
> 
>   For 3.3:
> 
>     lsl x1, x0, #m
>     add x2, x1, x0, lsl #n  // same cost with mul if n > 4
> 
> 
>   For 3.4:
> 
>     lsl x1, x0, #m
>     add x2, x1, x0, lsl #n  // same cost with mul if n > 4
>     neg x2, x2
> 
> 
> Test results (ns/op) on Arm Neoverse V2:
> 
> 		Before	Now	Uplift		Pattern	Notes
> testInt9	103.379	60.702	1.70305756	3.1
> testIntN33	103.231	106.825	0.96635619	3.2	n > 4
> testIntN9	103.448	103.005	1.004300762	3.2	n <= 4
> testInt18	103.354	99.271	1.041129837	3.3	m <= 4, n <= 4
> testInt36	103.396	99.186	1.042445506	3.3	m > 4, n <= 4
> testInt96	103.337	105.416	0.980278136	3.3	m > 4, n > 4
> testIntN18	103.333	139.258	0.742025593	3.4	m <= 4, n <= 4
> testIntN36	103.208	139.132	0.741799155	3.4	m > 4, n <= 4
> testIntN96	103.367	139.471	0.74113615	3.4	m > 4, n > 4
> 
> 
> **(S1) From this point on, we should treat pattern 3 as follows:**
> 3.1  C = (1<<n) + 1          (n>0)
> 3.2  C = -((1<<n) + 1)       (0<n<=4)
> 3.3  C = (1<<m) + (1<<n)     (m>n, 0<n<=4)
> 3.4  C = -((1<<m) + (1<<n))  (disable)
> 
> Since this conversion is implemented in mid-end, it impacts...

Hi @erifan

I have some first questions / comments. I only scanned through quickly.

My biggest question:
You only mention aarch64. But we would need to know that your changes also work well on x64.

Also:
Can you summarize if your changes are only for performane of vectorization, or also for scalar code?

src/hotspot/share/opto/mulnode.cpp line 253:

> 251:   }
> 252: 
> 253:   // TODO: abs_con = (1<<m)+(1<<n) and con = -((1<<n)+1) was supported

Is this `TODO` here on purpose?

src/hotspot/share/opto/mulnode.cpp line 263:

> 261:   //
> 262:   // But if it's not vectorizable, maybe it's profitable to do the conversion on
> 263:   // some architectures, support it in backends if it's worthwhile.

If it is all about vectorization only: we could consider delaying the `Ideal` optimization until after loop-opts. Then we can keep the multiplication for vectorization, and only use shift/add once we know that we cannot vectorize.

src/hotspot/share/opto/mulnode.cpp line 273:

> 271:         new LShiftINode(in(1), phase->intcon(log2i_exact(abs_con - 1))));
> 272:     res = new AddINode(in(1), n1);
> 273:   } else if (is_power_of_2(abs_con + 1)) {

So now you only check for `power_of_2   +- 1`, right? But before we also looked at patterns with 2 bits, such as `64 + 8`.

You would really need to prove that this is not a loss on any of the platforms we care about, incl. x64.

test/hotspot/jtreg/compiler/c2/TestSerialAdditions.java line 148:

> 146:     @IR(counts = { IRNode.MUL_I, "1" })
> 147:     private static int addTo6(int a) {
> 148:         return a + a + a + a + a + a;

Is this an improvement?

test/hotspot/jtreg/compiler/loopopts/superword/TestAlignVector.java line 1059:

> 1057:             b[i * 6 + 1] = (byte) (a[i * 6 + 1] & mask);
> 1058:             b[i * 6 + 2] = (byte) (a[i * 6 + 2] & mask);
> 1059:             b[i * 6 + 3] = (byte) (a[i * 6 + 3] & mask);

Why did this change?

Was `i * 6` not supposed to be changed to `(i << 2) + (i << 1)`? This is the reason why the current impl of VPointer is not parsing this right, but after my patch https://github.com/openjdk/jdk/pull/21926 this will be fixed because we will parse the multiple occurances of `i` properly.

So it looks like you are now sometimes keeping it at `i * 6` instead of splitting it. Why?

-------------

PR Review: https://git.openjdk.org/jdk/pull/22922#pullrequestreview-2534960901
PR Review Comment: https://git.openjdk.org/jdk/pull/22922#discussion_r1905807007
PR Review Comment: https://git.openjdk.org/jdk/pull/22922#discussion_r1905813470
PR Review Comment: https://git.openjdk.org/jdk/pull/22922#discussion_r1905816741
PR Review Comment: https://git.openjdk.org/jdk/pull/22922#discussion_r1905808320
PR Review Comment: https://git.openjdk.org/jdk/pull/22922#discussion_r1905802482