RFR: 8310190: C2 SuperWord: AlignVector is broken, generates misaligned packs [v58]

Thu Jan 4 07:00:51 UTC 2024

On Wed, 3 Jan 2024 19:50:49 GMT, Vladimir Kozlov <kvn at openjdk.org> wrote:

>> src/hotspot/share/opto/compile.cpp line 3713:
>> 
>>> 3711:         // to ObjectAlignmentInBytes. Hence, even if multiple arrays are accessed in
>>> 3712:         // a loop we can expect at least the following alignment:
>>> 3713:         jlong guaranteed_alignment = MIN2(vector_width, (jlong)ObjectAlignmentInBytes);
>> 
>> This is more relaxed check than the actual alignment required. As I understand it is because it checks only base address of array and not actually memory address to which vector instruction is accessed (which is (base,index,offset)).
>> It is useful but does not guarantee correct alignment of vector access instructions.
>> 
>> Consider using `lea` instruction on x86 to load memory address into register and check it.
>
> May be hack only `loadV` and `storeV` instructions in .ad file to use `lea` and do the check.

I don't understand this comment.
The `LoadVector` and `StoreVector` have both a `MemNode::Address` input, which I think it the memory address.
The address itself usually consists of `AddP` nodes, which do the (base, index, offset) computation. These nodes later can be folded into the load/store itself, or be computed with a `lea`.
I simply take the address value, check it for alignment and pass it on to the load/store.

Take this example:

public class Test {
    static int RANGE = 1024*64;

    public static void main(String[] strArr) {
        int a[] = new int[RANGE];
        test0(a);
    }

    static void test0(int[] a) {
        for (int i = 0; i < RANGE; i++) {
            a[i]++;
        }
    }
}

`./java -XX:CompileCommand=compileonly,Test::test* -XX:+TraceSuperWord -Xcomp -XX:+PrintIdeal -XX:+AlignVector  -XX:+VerifyAlignVector -XX:CompileCommand=print,Test::test* Test.java`

This looks like the main loop:

 ;; B22: #	out( B22 B23 ) <- in( B21 B22 ) Loop( B22-B22 inner post of N743) Freq: 4.49988
  0x00007f83c8bb2f68:   lea    0x10(%rbp,%rbx,4),%r10       ;*iaload {reexecute=0 rethrow=0 return_oop=0}
                                                            ; - Test::test0 at 12 (line 11)
  0x00007f83c8bb2f6d:   mov    %r10,%r8
  0x00007f83c8bb2f70:   test   $0x7,%r8b
  0x00007f83c8bb2f74:   je     0x00007f83c8bb2f8a
  0x00007f83c8bb2f76:   movabs $0x7f83d77e2fc8,%rdi         ;   {external_word}
  0x00007f83c8bb2f80:   and    $0xfffffffffffffff0,%rsp
  0x00007f83c8bb2f84:   callq  0x00007f83d71a0162           ;   {runtime_call MacroAssembler::debug64(char*, long, long*)}
  0x00007f83c8bb2f89:   hlt    
  0x00007f83c8bb2f8a:   test   $0x7,%r10b
  0x00007f83c8bb2f8e:   je     0x00007f83c8bb2fa4
  0x00007f83c8bb2f90:   movabs $0x7f83d77e2fc8,%rdi         ;   {external_word}
  0x00007f83c8bb2f9a:   and    $0xfffffffffffffff0,%rsp
  0x00007f83c8bb2f9e:   callq  0x00007f83d71a0162           ;   {runtime_call MacroAssembler::debug64(char*, long, long*)}
  0x00007f83c8bb2fa3:   hlt    
  0x00007f83c8bb2fa4:   vpaddd (%r10),%zmm5,%zmm0
  0x00007f83c8bb2faa:   vmovdqu32 %zmm0,(%r8)               ;*iastore {reexecute=0 rethrow=0 return_oop=0}
                                                            ; - Test::test0 at 15 (line 11)
  0x00007f83c8bb2fb0:   add    $0x10,%ebx                   ;*iinc {reexecute=0 rethrow=0 return_oop=0}
                                                            ; - Test::test0 at 16 (line 10)
  0x00007f83c8bb2fb3:   cmp    %r11d,%ebx
  0x00007f83c8bb2fb6:   jl     0x00007f83c8bb2f68           ;*if_icmpge {reexecute=0 rethrow=0 return_oop=0}
                                                            ; - Test::test0 at 6 (line 10)

What I see: `lea` computes address, stores to register `r10`. Move value to `r8`, do alignment check `test   $0x7,%r8b`, which checks for 8 byte alignment. We do the same check again with `r10b`, since we use the same address for load and store. And then we directly load/store with those register values:

vpaddd (%r10),%zmm5,%zmm0
vmovdqu32 %zmm0,(%r8)

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1441398603