RFR: 8284981: Support the vectorization of some counting-down loops in SLP [v2]
Vladimir Kozlov
kvn at openjdk.java.net
Thu Apr 28 18:57:00 UTC 2022
On Thu, 28 Apr 2022 06:33:51 GMT, Fei Gao <fgao at openjdk.org> wrote:
>> SLP can vectorize basic counting-down or counting-up loops. But for the counting-down loop below, in which array index scale
>> is negative and index starts from a constant value, SLP can't succeed in vectorizing.
>>
>>
>> private static final int SIZE = 2345;
>> private static int[] a = new int[SIZE];
>> private static int[] b = new int[SIZE];
>>
>> public static void bar() {
>> for (int i = 1000; i > 0; i--) {
>> b[SIZE - i] = a[SIZE - i];
>> }
>> }
>>
>>
>> Generally, it's necessary to find adjacent memory operations, i.e. load/store, after unrolling in SLP. Constructing SWPointers[1] for all memory operations is a key step to determine if these memory operations are adjacent. To construct a SWPointer successfully, SLP should first recognize the pattern of the memory address and normalize it. The address pattern of the memory operations in the case above can be visualized as:
>> 
>> which is equivalent to `(N - (long) i) << 2`. SLP recursively resolves the address mode by SWPointer::scaled_iv_plus_offset(). When arriving at the `SubL` node, it accepts `SubI` only and finally rejects the pattern of the case above[2]. In this way, SLP can't construct effective SWPointers for these memory operations and the process of vectorization breaks off.
>>
>> The pattern like `(N - (long) i) << 2` is formal and easy to resolve. We add the pattern of SubL in the patch to vectorize counting-down loops like the case above.
>>
>> After the patch, generated loop code for above case is like below on
>> aarch64:
>>
>> LOOP: mov w10, w12
>> sxtw x12, w10
>> neg x0, x12
>> lsl x0, x0, #2
>> add x1, x17, x0
>> ldr q16, [x1, x2]
>> add x0, x18, x0
>> str q16, [x0, x2]
>> ldr q16, [x1, x13]
>> str q16, [x0, x13]
>> ldr q16, [x1, x14]
>> str q16, [x0, x14]
>> ldr q16, [x1, x15]
>> sub x12, x11, x12
>> lsl x12, x12, #2
>> add x3, x17, x12
>> str q16, [x0, x15]
>> ldr q16, [x3, x2]
>> add x12, x18, x12
>> str q16, [x12, x2]
>> ldr q16, [x1, x16]
>> str q16, [x0, x16]
>> ldr q16, [x3, x14]
>> str q16, [x12, x14]
>> ldr q16, [x3, x15]
>> str q16, [x12, x15]
>> sub w12, w10, #0x20
>> cmp w12, #0x1f
>> b.gt LOOP
>>
>>
>> This patch also works on x86 simd machines. We tested full jtreg on both aarch64 and x86 platforms. All tests passed.
>>
>> [1] https://github.com/openjdk/jdk/blob/b56df2808d79dcc1e2d954fe38dd84228c683e8b/src/hotspot/share/opto/superword.cpp#L3826
>> [2] https://github.com/openjdk/jdk/blob/b56df2808d79dcc1e2d954fe38dd84228c683e8b/src/hotspot/share/opto/superword.cpp#L3953
>
> Fei Gao has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision:
>
> - Add an IR testcase
>
> Change-Id: If67d200754ed5a579510b46041b2ba8c3c4db22e
> - Merge branch 'master' into fg8284981
>
> Change-Id: I1bc92486ecc0da8917131cc55e9c5694d3c3eae5
> - 8284981: Support the vectorization of some counting-down loops in SLP
>
> SLP can vectorize basic counting-down or counting-up loops. But
> for the counting-down loop below, in which array index scale
> is negative and index starts from a constant value, SLP can't
> succeed in vectorizing.
>
> ```
> private static final int SIZE = 2345;
> private static int[] a = new int[SIZE];
> private static int[] b = new int[SIZE];
>
> public static void bar() {
> for (int i = 1000; i > 0; i--) {
> b[SIZE - i] = a[SIZE - i];
> }
> }
> ```
>
> Generally, it's necessary to find adjacent memory operations, i.e.
> load/store, after unrolling in SLP. Constructing SWPointers[1] for
> all memory operations is a key step to determine if these memory
> operations are adjacent. To construct a SWPointer successfully,
> SLP should first recognize the pattern of the memory address and
> normalize it. The address pattern of the memory operations in the
> case above can be visualized as:
>
> Phi
> /
> ConL ConvI2L
> \ /
> SubL ConI
> \ /
> LShiftL
>
> which is equivalent to `(N - (long) i) << 2`. SLP recursively
> resolves the address mode by SWPointer::scaled_iv_plus_offset().
> When arriving at the `SubL` node, it accepts `SubI` only and finally
> rejects the pattern of the case above[2]. In this way, SLP can't
> construct effective SWPointers for these memory operations and
> the process of vectorization breaks off.
>
> The pattern like `(N - (long) i) << 2` is formal and easy to
> resolve. We add the pattern of SubL in the patch to vectorize
> counting-down loops like the case above.
>
> After the patch, generated loop code for above case is like below on
> aarch64:
> ```
> LOOP: mov w10, w12
> sxtw x12, w10
> neg x0, x12
> lsl x0, x0, #2
> add x1, x17, x0
> ldr q16, [x1, x2]
> add x0, x18, x0
> str q16, [x0, x2]
> ldr q16, [x1, x13]
> str q16, [x0, x13]
> ldr q16, [x1, x14]
> str q16, [x0, x14]
> ldr q16, [x1, x15]
> sub x12, x11, x12
> lsl x12, x12, #2
> add x3, x17, x12
> str q16, [x0, x15]
> ldr q16, [x3, x2]
> add x12, x18, x12
> str q16, [x12, x2]
> ldr q16, [x1, x16]
> str q16, [x0, x16]
> ldr q16, [x3, x14]
> str q16, [x12, x14]
> ldr q16, [x3, x15]
> str q16, [x12, x15]
> sub w12, w10, #0x20
> cmp w12, #0x1f
> b.gt LOOP
> ```
>
> This patch also works on x86 simd machines. We tested full jtreg on both
> aarch64 and x86 platforms. All tests passed.
>
> [1] https://github.com/openjdk/jdk/blob/b56df2808d79dcc1e2d954fe38dd84228c683e8b/src/hotspot/share/opto/superword.cpp#L3826
> [2] https://github.com/openjdk/jdk/blob/b56df2808d79dcc1e2d954fe38dd84228c683e8b/src/hotspot/share/opto/superword.cpp#L3953
>
> Change-Id: Ifcd8f8351ec5b4f7676e6ef134d279a67358b0fb
tier1 passed. Please wait testing report from Tobias before integration.
-------------
Marked as reviewed by kvn (Reviewer).
PR: https://git.openjdk.java.net/jdk/pull/8289
More information about the hotspot-compiler-dev
mailing list