RFR: 8346964: C2: Improve integer multiplication with constant in MulINode::Ideal()

Mon Feb 10 21:25:00 UTC 2025

Constant multiplication x*C can be optimized as LEFT SHIFT, ADD or SUB instructions since generally these instructions have smaller latency and larger throughput on most architectures. For example:
1. x*8 can be optimized as x<<3.
2. x*9 can be optimized as x+x<<3, and x+x<<3 can be lowered as one SHIFT-ADD (ADD instruction combined with LEFT SHIFT) instruction on some architectures, like aarch64 and x86_64.

Currently OpenJDK implemented a few such patterns in mid-end, including:
1. |C| = 1<<n                (n>0)
2. |C| = (1<<n) - 1          (n>0)
3. |C| = (1<<m) + (1<<n)     (m>n, n>=0)

The first two are ok. Because on most architectures they are lowered as only one ADD/SUB/SHIFT instruction.

But the third pattern doesn't always perform well on some architectures, such as aarch64. The third pattern can be split as the following sub patterns:
3.1. C = (1<<n) + 1          (n>0)
3.2. C = -((1<<n) + 1)       (n>0)
3.3. C = (1<<m) + (1<<n)     (m>n, n>0)
3.4. C = -((1<<m) + (1<<n))  (m>n, n>0)

According to Arm optimization guide, if the shift amount > 4, the latency and throughput of ADD instruction is the same with MUL instruction. So in this case, converting MUL to ADD is not profitable. Take a[i] * C on aarch64 as an example.

Before (MUL is not converted):

  mov x1, #C
  mul x2, x1, x0

Now (MUL is converted):
  For 3.1:

    add x2, x0, x0, lsl #n

  For 3.2:

    add x2, x0, x0, lsl #n  // same cost with mul if n > 4
    neg x2, x2

  For 3.3:

    lsl x1, x0, #m
    add x2, x1, x0, lsl #n  // same cost with mul if n > 4

  For 3.4:

    lsl x1, x0, #m
    add x2, x1, x0, lsl #n  // same cost with mul if n > 4
    neg x2, x2

Test results (ns/op) on Arm Neoverse V2:

		Before	Now	Uplift		Pattern	Notes
testInt9	103.379	60.702	1.70305756	3.1
testIntN33	103.231	106.825	0.96635619	3.2	n > 4
testIntN9	103.448	103.005	1.004300762	3.2	n <= 4
testInt18	103.354	99.271	1.041129837	3.3	m <= 4, n <= 4
testInt36	103.396	99.186	1.042445506	3.3	m > 4, n <= 4
testInt96	103.337	105.416	0.980278136	3.3	m > 4, n > 4
testIntN18	103.333	139.258	0.742025593	3.4	m <= 4, n <= 4
testIntN36	103.208	139.132	0.741799155	3.4	m > 4, n <= 4
testIntN96	103.367	139.471	0.74113615	3.4	m > 4, n > 4

**(S1) From this point on, we should treat pattern 3 as follows:**
3.1  C = (1<<n) + 1          (n>0)
3.2  C = -((1<<n) + 1)       (0<n<=4)
3.3  C = (1<<m) + (1<<n)     (m>n, 0<n<=4)
3.4  C = -((1<<m) + (1<<n))  (disable)

Since this conversion is implemented in mid-end, it impacts other optimizations, such as auto-vectorization. Assume there's the following loop which can be vectorized.
Vector-A:

for (int i=0; i<len; i++) {
  sum += a[i] * C;
}

Before:

  movi    v19.4s, #C  // this will be hoisted out of the loop
  mla     v16.4s, v17.4s, v19.4s

After:
  For 3.1:

    shl     v19.4s, v17.4s, #m
    add     v17.4s, v19.4s, v17.4s
    add     v16.4s, v16.4s, v17.4s

  For 3.2:

    (add    w11, w11, w11, lsl #m
    sub     w11, w12, w11) * 4  // *4 is for 4 ints

  For 3.3:

    shl     v18.4s, v17.4s, #m
    shl     v19.4s, v17.4s, #n
    add     v18.4s, v19.4s, v18.4s
    add     v16.4s, v16.4s, v18.4s

  For 3.4:

    (lsl    w12, w4, #m
    add     w11, w12, w4, lsl #n
    sub     w13, w13, w11) * 4  // *4 is for 4 ints

The generated instruction before is more simple and performing:

			Before	Now	Uplift		Pattern
testInt9AddSum		47.958	63.696	0.752920121	3.1
testIntN33AddSum	48.013	147.834	0.324776438	3.2
testIntN9AddSum		48.026	149.149	0.322000148	3.2
testInt18AddSum		47.971	69.393	0.691294511	3.3
testInt36AddSum		47.98	69.395	0.69140428	3.3
testInt96AddSum		47.992	69.453	0.690999669	3.3
testIntN18AddSum	48.014	157.132	0.305564748	3.4
testIntN36AddSum	48.02	157.094	0.305676856	3.4
testIntN96AddSum	48.032	153.642	0.312622851	3.4

**(S2) From this point on, we should disable pattern 3 totally.**

But we can have different cases, for example:
Vector-B:

for (int i=0; i<100000; i++) {
  a[i] = a[i] * C;
}

Test results:

		Before	Now	Uplift		Pattern
testInt9Store	43.392	33.338	1.301577779	3.1
testIntN33Store	43.365	75.993	0.570644665	3.2
testIntN9Store	43.5	75.452	0.576525473	3.2
testInt18Store	43.442	41.847	1.038115038	3.3
testInt36Store	43.369	41.843	1.03646966	3.3
testInt96Store	43.389	41.931	1.03477141	3.3
testIntN18Store	43.372	57.909	0.748968209	3.4
testIntN36Store	43.373	57.042	0.760369552	3.4
testIntN96Store	43.405	58.145	0.746495829	3.4

**(S3) From this point on, we should treat pattern 3 as follow:**
3.1  C = (1<<n) + 1          (n>0)
3.2  C = -((1<<n) + 1)       (disable)
3.3  C = (1<<m) + (1<<n)     (m>n, n>0)
3.4  C = -((1<<m) + (1<<n))  (disable)

Combining S1, S2 and S3, we get:
Pattern		S1		S2			S3
3.1	(n>0, 1.7)		(disable, 0.75)		(n>0, 1.3)
3.2	(0<n<=4, 1.0)		(disable, 0.32)		(disable, 0.57) 3.3	(m>n, 0<n<=4, 1.04)	(disable, 0.69)		(m>n, n>0, 1.03) 3.4	(disable, 0.74)		(disable, 0.30)		(disable, 0.74)

For 3.1, it's similar with pattern 2, usually be lowered as only one instruction, so we tend to keep it in mid-end.
For 3.2, we tend to disable it in mid-end, and do S1 in back-end if it's profitable.
For 3.3, although S3 has 3% performance gain, but S2 has 31% performance regression. So we tend to disable it in mid-end and redo S1 in back-end. For 3.4, we shouldn't do this optimization anywhere.

In theory, auto-vectorization should be able to generate the best vectorized code, and cases that cannot be vectorized will be converted into other more optimal scalar instructions in the architecture backend (this is what gcc and llvm do). However, we currently do not have a cost model and vplan, and the results of auto-vectorization are significantly affected by its input. Therefore, this patch turns off pattern 3.2, 3.3 and 3.4 in mid-end. Then if it's profitable, implement these patterns in the backend. If we implement a cost model and vplan in the future, it is best to move all patterns to the backend, this patch does not conflict with this direction.

I also tested this patch on Arm N1, Intel SPR and AMD Genoa machines, No noticeable performance degradation was seen on any of the machines.

Here are the test results on an Arm V2 and an AMD Genoa machine:

Benchmark	  V2-now	V2-after	Uplift	Genoa-now	Genoa-after	Uplift	Pattern	Notes
testInt8	60.36989	60.276736	1	116.768294	116.772547	0.99	1
testInt8AddSum	63.658064	63.797732	0.99	16.04973	16.051491	0.99	1
testInt8Store	38.829618	39.054129	0.99	19.857453	20.006321	0.99	1
testIntN8	59.99655	60.150053	0.99	132.269926	132.252473	1	1
testIntN8AddSum	145.678098	146.181549	0.99	158.546226	158.806476	0.99	1
testIntN8Store	32.802445	32.897907	0.99	19.047873	19.065941	0.99	1
testInt7	98.978213	99.176574	0.99	114.07026	113.08989	1	2
testInt7AddSum	62.675636	62.310799	1	23.370851	20.971655	1.11	2
testInt7Store	32.850828	32.923315	0.99	23.884952	23.628681	1.01	2
testIntN7	60.27949	60.668158	0.99	174.224893	174.102295	1	2
testIntN7AddSum	62.746696	62.288476	1	20.93192	20.964557	0.99	2
testIntN7Store	32.812906	32.851355	0.99	23.810024	23.526074	1.01	2
testInt9	60.820402	60.331938	1	108.850777	108.846161	1	3.1
testInt9AddSum	62.24679	62.374637	0.99	20.698749	20.741137	0.99	3.1
testInt9Store	32.871723	32.912065	0.99	19.055537	19.080735	0.99	3.1
testIntN33	106.517618	103.450746	1.02	153.894345	140.641135	1.09	3.2	n > 4
testIntN33AddSum	147.589815	47.911612	3.08	153.851885	17.008453	9.04	3.2
testIntN33Store	75.434513	43.473053	1.73	26.612181	20.436323	1.3	3.2
testIntN9	102.173268	103.70682	0.98	155.858169	140.718967	1.1	3.2	n <= 4
testIntN9AddSum	148.724952	47.963305	3.1	186.902111	20.249414	9.23	3.2
testIntN9Store	74.783788	43.339188	1.72	20.150159	20.888448	0.96	3.2
testInt18	98.905625	102.942092	0.96	142.480636	140.748778	1.01	3.3	m <= 4, n <= 4
testInt18AddSum	68.695585	48.103536	1.42	26.88524	16.77886	1.6	3.3
testInt18Store	41.307909	43.385183	0.95	21.233238	20.875026	1.01	3.3
testInt36	99.039742	103.714745	0.95	142.265806	142.334039	0.99	3.3	m > 4, n <= 4
testInt36AddSum	68.736756	47.952189	1.43	26.868362	17.030035	1.57	3.3
testInt36Store	41.403698	43.414093	0.95	21.225454	20.52266	1.03	3.3
testInt96	105.00287	103.528144	1.01	237.649526	140.643255	1.68	3.3	m > 4, n > 4
testInt96AddSum	68.481133	48.04549	1.42	26.877407	16.918209	1.58	3.3
testInt96Store	41.276292	43.512994	0.94	23.456117	20.540181	1.14	3.3
testIntN18	138.629044	103.269657	1.34	210.315628	140.716818	1.49	3.4	m <= 4, n <= 4
testIntN18AddSum	156.635652	48.003989	3.26	215.807135	16.917665	12.75	3.4
testIntN18Store	57.584487	43.410415	1.32	26.819827	20.707778	1.29	3.4
testIntN36	139.068861	103.766774	1.34	209.522432	140.720322	1.48	3.4	m > 4, n <= 4
testIntN36AddSum	156.36928	48.027779	3.25	215.705842	16.893192	12.76	3.4
testIntN36Store	57.715418	43.493958	1.32	21.651252	20.676877	1.04	3.4
testIntN96	139.151761	103.453665	1.34	269.254161	140.753499	1.91	3.4	m > 4, n > 4
testIntN96AddSum	153.123557	48.110524	3.18	263.262635	17.011144	15.47	3.4
testIntN96Store	57.793179	43.47574	1.32	24.444592	20.530219	1.19	3.4

limitations:
1, This patch only analyzes two vector cases, there may be other vector cases that may get performance regression with this patch. 2, This patch does not implement the disabled patterns in the backend, I will propose a follow-up patch to implement these patterns in the aarch64 backend.
3, This patch does not handle the long type, because different architectures have different auto-vectorization support for long type, resulting in very different performance, and it is difficult to find a solution that does not introduce significant performance degradation.

-------------

Commit messages:
 - 8346964: C2: Improve integer multiplication with constant in MulINode::Ideal()

Changes: https://git.openjdk.org/jdk/pull/22922/files
  Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=22922&range=00
  Issue: https://bugs.openjdk.org/browse/JDK-8346964
  Stats: 472 lines in 5 files changed: 430 ins; 15 del; 27 mod
  Patch: https://git.openjdk.org/jdk/pull/22922.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/22922/head:pull/22922

PR: https://git.openjdk.org/jdk/pull/22922