Problem with auto-vectorized code in AVX2

Mon Jul 16 20:53:24 UTC 2018

Hi Vladimir,
I'll take a look at this. 

Thanks,
Shravya.

-----Original Message-----
From: Vladimir Ivanov [mailto:vladimir.x.ivanov at oracle.com] 
Sent: Friday, July 13, 2018 11:37 AM
To: panama-dev at openjdk.java.net; Rukmannagari, Shravya <shravya.rukmannagari at intel.com>; Deshpande, Vivek R <vivek.r.deshpande at intel.com>
Cc: Adam Petcher <adam.petcher at oracle.com>
Subject: Re: Problem with auto-vectorized code in AVX2

I haven't dove into details, but in case of -XX:UseAVX=1 2-element long vector & vmul2L_reg_avx are used and the test works fine.

The only difference between vmul2L_reg_avx and vmul4L_reg_avx is vector_len.

Best regards,
Vladimir Ivanov

On 13/07/2018 21:09, Vladimir Ivanov wrote:
> Thanks for the report, Adam!
> 
> It looks like the bug is in vmul4L_reg_avx rule [1]
> 
> Auto-vectorizer in the problematic method (TestArrayMult::main) 
> produces the following tracing output:
>    new Vector node:  1831  LoadVector  === ... #vectory[4]:{long} ...
>    new Vector node:  1832  ReplicateL  === ... #vectory[4]:{long} ...
>    new Vector node:  1833  MulVL === ... #vectory[4]:{long} ...
>    new Vector node:  1834  StoreVector ===  ...
> 
> The rule was added by:
>    changeset:   49241:fda63bc58cd1
>    branch:      vectorIntrinsics
>    user:        srukmannagar
>    date:        Thu Mar 01 14:30:02 2018 -0800
>    summary:     vector intrinsics for long and byte mul
> 
> Shravya/Vivek, can you take a look, please?
> 
> Best regards,
> Vladimir Ivanov
> 
> [1]
> http://hg.openjdk.java.net/panama/dev/file/c33e709a35e5/src/hotspot/cp
> u/x86/x86.ad#l13974
> 
> 
> instruct vmul4L_reg_avx(vecY dst, vecY src1, vecY src2, vecY tmp, vecY
> tmp1) %{
>    predicate(UseAVX > 1 && n->as_Vector()->length() == 4 && 
> VM_Version::supports_avx2());
>    match(Set dst (MulVL src1 src2));
>    effect(TEMP tmp1, TEMP tmp);
>    format %{ "vpshufd $tmp,$src2\n\t"
>              "vpmulld $tmp,$src1,$tmp\n\t"
>              "vphaddd $tmp,$tmp,$tmp\n\t"
>              "vpmovzxdq $tmp,$tmp\n\t"
>              "vpsllq $tmp,$tmp\n\t"
>              "vpmuludq $tmp1,$src1,$src2\n\t"
>              "vpaddq $dst,$tmp,$tmp1\t! mul packed4L" %}
>    ins_encode %{
>      int vector_len = 1;
>      __ vpshufd($tmp$$XMMRegister, $src2$$XMMRegister, 177, 
> vector_len);
>      __ vpmulld($tmp$$XMMRegister, $src1$$XMMRegister, 
> $tmp$$XMMRegister, vector_len);
>      __ vphaddd($tmp$$XMMRegister, $tmp$$XMMRegister, 
> $tmp$$XMMRegister, vector_len);
>      __ vpmovzxdq($tmp$$XMMRegister, $tmp$$XMMRegister, vector_len);
>      __ vpsllq($tmp$$XMMRegister, $tmp$$XMMRegister, 32, vector_len);
>      __ vpmuludq($tmp1$$XMMRegister, $src1$$XMMRegister, 
> $src2$$XMMRegister, vector_len);
>      __ vpaddq($dst$$XMMRegister, $tmp$$XMMRegister, 
> $tmp1$$XMMRegister, vector_len);
>    %}
>    ins_pipe( pipe_slow );
> %}
> 
> 
> 
> On 13/07/2018 18:42, Adam Petcher wrote:
>> I've been working on an experiment in which I try to improve the 
>> performance of X25519[1] using the Vector API and the 
>> vectorIntrinsics branch. I started by building this branch and 
>> running some tests with the existing code to make sure everything 
>> works and get some baseline measurements. I didn't get very far 
>> because some tests fail and the code produces incorrect results on my machines.
>>
>> I get the same problematic results on both a Haswell Macbook Pro and 
>> an Ubuntu VM running on a Skylake Windows laptop. If I modify the VM 
>> to remove AVX2 (but leave AVX and everything else), then I don't 
>> encounter the problem. I also tried testing on an older Linux machine 
>> that doesn't have AVX2, and I don't see the problem.
>>
>> The first problem is that an existing test fails: 
>> test/jdk/sun/security/util/math/TestIntegerModuloP.java. This test 
>> exercises the finite field operations used in X25519, which all boil 
>> down to simple arithmetic and bitwise operations on long arrays. The 
>> test fails because it produces a result that is incorrect (it is 
>> different from the value produced by performing the same operation on 
>> BigInteger). To isolate the problem, I developed the simplified test 
>> that is attached. This test fails intermittently with default 
>> settings, and it seems to fail every time with -Xcomp. I've also 
>> attached the assembly that is produced on my Macbook. You can see 
>> some AVX instructions near line 3259---this appears to be an 
>> auto-vectorization of the loop that multiplies the array by a value.
>> An interesting thing about this test is that I tried to simplify it 
>> further, but then the problem goes away. So all of that code seems to 
>> be necessary to make the problem happen.
>>
>> Is anyone else seeing a similar problem with AVX2? Is this a bug in 
>> the intrinsics, or am I doing something wrong here?
>>
>> [1] http://openjdk.java.net/jeps/324
>>