RFR: 8300258: C2: vectorization fails on simple ByteBuffer loop [v2]
Vladimir Kozlov
kvn at openjdk.org
Thu Feb 23 19:45:08 UTC 2023
On Tue, 21 Feb 2023 08:26:59 GMT, Roland Westrelin <roland at openjdk.org> wrote:
>> The loop that doesn't vectorize is:
>>
>>
>> public static void testByteLong4(byte[] dest, long[] src, int start, int stop) {
>> for (int i = start; i < stop; i++) {
>> UNSAFE.putLongUnaligned(dest, 8 * i + baseOffset, src[i]);
>> }
>> }
>>
>>
>> It's from a micro-benchmark in the panama
>> repo. `SuperWord::find_adjacent_refs() `prevents it from vectorizing
>> because it finds it cannot properly align the loop and, from the
>> comment in the code, that:
>>
>>
>> // Can't allow vectorization of unaligned memory accesses with the
>> // same type since it could be overlapped accesses to the same array.
>>
>>
>> The test for "same type" is implemented by looking at the memory
>> operation type which in this case is overly conservative as the loop
>> above is reading and writing with long loads/stores but from and to
>> arrays of different types that can't overlap. Actually, with such
>> mismatched accesses, it's also likely an incorrect test (reading and
>> writing could be to the same array with loads/stores that use
>> different operand size) eventhough I couldn't write a test case that
>> would trigger an incorrect execution.
>>
>> As a fix, I propose implementing the "same type" test by looking at
>> memory aliases instead.
>
> Roland Westrelin has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains seven additional commits since the last revision:
>
> - comments
> - extra test
> - more
> - Merge branch 'master' into JDK-8300258
> - review
> - more
> - fix & test
Yes, this code come from JDK-7119644 which improved superword and also take into account SPARC where we did not have unaligned memory vectors. From changes review:
New code is added to vectorize operations which have different basic types:
static void test_IBci(int[] a, byte[] b) {
for (int i = 0; i < a.length; i+=1) {
a[i] = -456;
b[i] = -(byte)123;
}
}
0a0 B12: # B12 B13 <- B11 B12 Loop: B12-B12 inner main of N90 Freq: 9340.37
0a0 movslq R10, R11 # i2l
0a3 movdqu [RSI + #16 + R10 << #2],XMM0 ! store vector (16 bytes)
0aa movq [RDX + #16 + R10],XMM1 ! store vector (8 bytes)
0b1 movslq R10, R11 # i2l
0b4 movdqu [RSI + #32 + R10 << #2],XMM0 ! store vector (16 bytes)
0bb addl R11, #8 # int
0bf cmpl R11, R8
0c2 jl,s B12 # loop end P=0.999893 C=46701.000000
previous code vectorized only one type in such case. New code collects during one iteration
only related memory operations (as before these changes). Then it removes these operations
from memops list and tries to collect other related mem ops. Such vectors need different loop
index alignment since vector sizes are different. The new code set maximum loop index alignment.
Max alignment also works for smaller sizes since sizes are power of 2.
-------------
PR: https://git.openjdk.org/jdk/pull/12440
More information about the hotspot-compiler-dev
mailing list