RFR: 8300258: C2: vectorization fails on simple ByteBuffer loop [v2]

Thu Feb 23 19:48:10 UTC 2023

On Tue, 21 Feb 2023 08:26:59 GMT, Roland Westrelin <roland at openjdk.org> wrote:

>> The loop that doesn't vectorize is:
>> 
>> 
>> public static void testByteLong4(byte[] dest, long[] src, int start, int stop) {
>>     for (int i = start; i < stop; i++) {
>>         UNSAFE.putLongUnaligned(dest, 8 * i + baseOffset, src[i]);
>>     }
>> }
>> 
>> 
>> It's from a micro-benchmark in the panama
>> repo. `SuperWord::find_adjacent_refs() `prevents it from vectorizing
>> because it finds it cannot properly align the loop and, from the
>> comment in the code, that:
>> 
>> 
>> // Can't allow vectorization of unaligned memory accesses with the
>> // same type since it could be overlapped accesses to the same array.
>> 
>> 
>> The test for "same type" is implemented by looking at the memory
>> operation type which in this case is overly conservative as the loop
>> above is reading and writing with long loads/stores but from and to
>> arrays of different types that can't overlap. Actually, with such
>> mismatched accesses, it's also likely an incorrect test (reading and
>> writing could be to the same array with loads/stores that use
>> different operand size) eventhough I couldn't write a test case that
>> would trigger an incorrect execution.
>> 
>> As a fix, I propose implementing the "same type" test by looking at
>> memory aliases instead.
>
> Roland Westrelin has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains seven additional commits since the last revision:
> 
>  - comments
>  - extra test
>  - more
>  - Merge branch 'master' into JDK-8300258
>  - review
>  - more
>  - fix & test

And follow up:

> How does it deal with index variables that might be offset?  Something like this:
>
>>  static void test_IBci(int[] a, byte[] b) {
>>    for (int i = 0; i < a.length - 1; i+=1) {
>>      a[i] = -456;
>>      b[i + 1] = -(byte)123;
>>    }
>>  }
>
>
> It's not obvious to me where that will be weeded out.

0b3       movq    [RDX + #17 + R10],XMM1    ! store vector (8 bytes)

It generates unaligned move. For x86 it does not matter (I used only unaligned 
asm instructions) but for SPARC it is disaster (40 times slow since it traps):

IBci: 2382
IBvi: 65

I will need to add a check (current align vs max align) into find_adjacent_refs to 
vectorize only aligned mem ops on SPARC. I want to keep unaligned mem ops 
on x86 since we can win on vector arithmetic.

-------------

PR: https://git.openjdk.org/jdk/pull/12440