RFR: 8346964: C2: Improve integer multiplication with constant in MulINode::Ideal()
Emanuel Peter
epeter at openjdk.org
Mon Feb 10 21:25:01 UTC 2025
On Tue, 7 Jan 2025 17:17:22 GMT, Emanuel Peter <epeter at openjdk.org> wrote:
>> Constant multiplication x*C can be optimized as LEFT SHIFT, ADD or SUB instructions since generally these instructions have smaller latency and larger throughput on most architectures. For example:
>> 1. x*8 can be optimized as x<<3.
>> 2. x*9 can be optimized as x+x<<3, and x+x<<3 can be lowered as one SHIFT-ADD (ADD instruction combined with LEFT SHIFT) instruction on some architectures, like aarch64 and x86_64.
>>
>> Currently OpenJDK implemented a few such patterns in mid-end, including:
>> 1. |C| = 1<<n (n>0)
>> 2. |C| = (1<<n) - 1 (n>0)
>> 3. |C| = (1<<m) + (1<<n) (m>n, n>=0)
>>
>> The first two are ok. Because on most architectures they are lowered as only one ADD/SUB/SHIFT instruction.
>>
>> But the third pattern doesn't always perform well on some architectures, such as aarch64. The third pattern can be split as the following sub patterns:
>> 3.1. C = (1<<n) + 1 (n>0)
>> 3.2. C = -((1<<n) + 1) (n>0)
>> 3.3. C = (1<<m) + (1<<n) (m>n, n>0)
>> 3.4. C = -((1<<m) + (1<<n)) (m>n, n>0)
>>
>> According to Arm optimization guide, if the shift amount > 4, the latency and throughput of ADD instruction is the same with MUL instruction. So in this case, converting MUL to ADD is not profitable. Take a[i] * C on aarch64 as an example.
>>
>> Before (MUL is not converted):
>>
>> mov x1, #C
>> mul x2, x1, x0
>>
>>
>> Now (MUL is converted):
>> For 3.1:
>>
>> add x2, x0, x0, lsl #n
>>
>>
>> For 3.2:
>>
>> add x2, x0, x0, lsl #n // same cost with mul if n > 4
>> neg x2, x2
>>
>>
>> For 3.3:
>>
>> lsl x1, x0, #m
>> add x2, x1, x0, lsl #n // same cost with mul if n > 4
>>
>>
>> For 3.4:
>>
>> lsl x1, x0, #m
>> add x2, x1, x0, lsl #n // same cost with mul if n > 4
>> neg x2, x2
>>
>>
>> Test results (ns/op) on Arm Neoverse V2:
>>
>> Before Now Uplift Pattern Notes
>> testInt9 103.379 60.702 1.70305756 3.1
>> testIntN33 103.231 106.825 0.96635619 3.2 n > 4
>> testIntN9 103.448 103.005 1.004300762 3.2 n <= 4
>> testInt18 103.354 99.271 1.041129837 3.3 m <= 4, n <= 4
>> testInt36 103.396 99.186 1.042445506 3.3 m > 4, n <= 4
>> testInt96 103.337 105.416 0.980278136 3.3 m > 4, n > 4
>> testIntN18 103.333 139.258 0.742025593 3.4 m <= 4, n <= 4
>> testIntN36 103.208 139.132 0.741799155 3.4 m > 4, n <= 4
>> testIntN96 103.367 139.471 0.74113615 3.4 m > 4, n > 4
>>
>>
>> **(S1) From this point on, we should treat pattern 3 as follows:**
>> 3.1 C = (1<<n) + 1 (n>0)
>> 3.2 C = -((1<<n) + 1) (0<n<=4)
>> 3.3 C...
>
> test/hotspot/jtreg/compiler/c2/TestSerialAdditions.java line 148:
>
>> 146: @IR(counts = { IRNode.MUL_I, "1" })
>> 147: private static int addTo6(int a) {
>> 148: return a + a + a + a + a + a;
>
> Is this an improvement?
Is this an improvement on aarch64 for all implementations? What about x64?
> test/hotspot/jtreg/compiler/loopopts/superword/TestAlignVector.java line 1059:
>
>> 1057: b[i * 6 + 1] = (byte) (a[i * 6 + 1] & mask);
>> 1058: b[i * 6 + 2] = (byte) (a[i * 6 + 2] & mask);
>> 1059: b[i * 6 + 3] = (byte) (a[i * 6 + 3] & mask);
>
> Why did this change?
>
> Was `i * 6` not supposed to be changed to `(i << 2) + (i << 1)`? This is the reason why the current impl of VPointer is not parsing this right, but after my patch https://github.com/openjdk/jdk/pull/21926 this will be fixed because we will parse the multiple occurances of `i` properly.
>
> So it looks like you are now sometimes keeping it at `i * 6` instead of splitting it. Why?
Also: this is very confusing: why does the result differ depending on `AlignVector`?
-------------
PR Review Comment: https://git.openjdk.org/jdk/pull/22922#discussion_r1905810588
PR Review Comment: https://git.openjdk.org/jdk/pull/22922#discussion_r1905811309
More information about the hotspot-compiler-dev
mailing list