RFR: 8310190: C2 SuperWord: AlignVector is broken, generates misaligned packs [v58]

Fri Jan 5 08:51:38 UTC 2024

On Fri, 5 Jan 2024 08:47:09 GMT, Emanuel Peter <epeter at openjdk.org> wrote:

>> @vnkozlov 
>>> Can you show assembler code for simple load and store instructions (move data from one array to another)?
>> 
>> Here the example with simple load -> store with two different arrays:
>> 
>> public class Test {
>>     static int RANGE = 1024*64;
>> 
>>     public static void main(String[] strArr) {
>>         int a[] = new int[RANGE];
>>         int b[] = new int[RANGE];
>>         test0(a, b);
>>     }
>> 
>>     static void test0(int[] a, int[] b) {
>>         for (int i = 0; i < RANGE; i++) {
>>             a[i] = b[i];
>>         }
>>     }
>> }
>> 
>> 
>> With `-XX:+VerifyAlignVector`:
>> `./java -XX:CompileCommand=compileonly,Test::test* -XX:+TraceSuperWord -Xcomp -XX:+PrintIdeal -XX:+AlignVector  -XX:+VerifyAlignVector -XX:CompileCommand=print,Test::test* Test.java`
>> 
>> 
>>  ;; B32: #	out( B32 B33 ) <- in( B31 B32 ) Loop( B32-B32 inner post of N1028) Freq: 4.49976
>>   0x00007fbef8bb31ec:   movslq %ebx,%r10
>>   0x00007fbef8bb31ef:   shl    $0x2,%r10                    ;*iaload {reexecute=0 rethrow=0 return_oop=0}
>>                                                             ; - Test::test0 at 13 (line 12)
>>   0x00007fbef8bb31f3:   lea    0x10(%r13,%r10,1),%r8
>>   0x00007fbef8bb31f8:   lea    0x10(%r11,%r10,1),%r10
>>   0x00007fbef8bb31fd:   test   $0x7,%r8b
>>   0x00007fbef8bb3201:   je     0x00007fbef8bb3217
>>   0x00007fbef8bb3203:   movabs $0x7fbf08c15fc8,%rdi         ;   {external_word}
>>   0x00007fbef8bb320d:   and    $0xfffffffffffffff0,%rsp
>>   0x00007fbef8bb3211:   callq  0x00007fbf085d3162           ;   {runtime_call MacroAssembler::debug64(char*, long, long*)}
>>   0x00007fbef8bb3216:   hlt    
>>   0x00007fbef8bb3217:   vmovdqu32 (%r8),%zmm0
>>   0x00007fbef8bb321d:   test   $0x7,%r10b
>>   0x00007fbef8bb3221:   je     0x00007fbef8bb3237
>>   0x00007fbef8bb3223:   movabs $0x7fbf08c15fc8,%rdi         ;   {external_word}
>>   0x00007fbef8bb322d:   and    $0xfffffffffffffff0,%rsp
>>   0x00007fbef8bb3231:   callq  0x00007fbf085d3162           ;   {runtime_call MacroAssembler::debug64(char*, long, long*)}
>>   0x00007fbef8bb3236:   hlt    
>>   0x00007fbef8bb3237:   vmovdqu32 %zmm0,(%r10)              ;*iastore {reexecute=0 rethrow=0 return_oop=0}
>>                                                             ; - Test::test0 at 14 (line 12)
>>   0x00007fbef8bb323d:   add    $0x10,%ebx                   ;*iinc {reexecute=0 rethrow=0 return_oop=0}
>>                                                             ; - Test::test0 at 15 (line 11)
>>   0x00007fbef8bb3240:   cmp    %r9...
>
>> My concern is that LoadV and StoreV are defined only with memory input
> 
> 
> // Indirect Memory Operand
> operand indirect(any_RegP reg)
> %{
>   constraint(ALLOC_IN_RC(ptr_reg));
>   match(reg);
> 
>   format %{ "[$reg]" %}
>   interface(MEMORY_INTER) %{
>     base($reg);
>     index(0x4);
>     scale(0x0);
>     disp(0x0);
>   %}
> %}
> 
> 
> 
> opclass memory(indirect, indOffset8, indOffset32, indIndexOffset, indIndex,
>                indIndexScale, indPosIndexScale, indIndexScaleOffset, indPosIndexOffset, indPosIndexScaleOffset,
>                indCompressedOopOffset,
>                indirectNarrow, indOffset8Narrow, indOffset32Narrow,
>                indIndexOffsetNarrow, indIndexNarrow, indIndexScaleNarrow,
>                indIndexScaleOffsetNarrow, indPosIndexOffsetNarrow, indPosIndexScaleOffsetNarrow);
> 
> It seems that `memory` summarizes many different patterns. One of them is the `indirect` one, which simply loads the address from a register. In our case this address was computed by a `lea`, then used in the alignment verification, and then passed on as `memory` to load / store.

> Also why your assembler example have tested alignment twice for the same address? May be because the same array's element for load and store?

Yes, exactly. I emit verification for every loadV / storeV. And since it is debug only, and only with the extra flag `-XX:+VerifyAlignVector` I thought optimizing is not necessary. And it seems you agree with that :)

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1442626289