Problem with auto-vectorized code in AVX2

Fri Jul 13 18:09:20 UTC 2018

Thanks for the report, Adam!

It looks like the bug is in vmul4L_reg_avx rule [1]

Auto-vectorizer in the problematic method (TestArrayMult::main) produces 
the following tracing output:
   new Vector node:  1831  LoadVector  === ... #vectory[4]:{long} ...
   new Vector node:  1832  ReplicateL  === ... #vectory[4]:{long} ...
   new Vector node:  1833  MulVL === ... #vectory[4]:{long} ...
   new Vector node:  1834  StoreVector ===  ...

The rule was added by:
   changeset:   49241:fda63bc58cd1
   branch:      vectorIntrinsics
   user:        srukmannagar
   date:        Thu Mar 01 14:30:02 2018 -0800
   summary:     vector intrinsics for long and byte mul

Shravya/Vivek, can you take a look, please?

Best regards,
Vladimir Ivanov

[1] 
http://hg.openjdk.java.net/panama/dev/file/c33e709a35e5/src/hotspot/cpu/x86/x86.ad#l13974

instruct vmul4L_reg_avx(vecY dst, vecY src1, vecY src2, vecY tmp, vecY 
tmp1) %{
   predicate(UseAVX > 1 && n->as_Vector()->length() == 4 && 
VM_Version::supports_avx2());
   match(Set dst (MulVL src1 src2));
   effect(TEMP tmp1, TEMP tmp);
   format %{ "vpshufd $tmp,$src2\n\t"
             "vpmulld $tmp,$src1,$tmp\n\t"
             "vphaddd $tmp,$tmp,$tmp\n\t"
             "vpmovzxdq $tmp,$tmp\n\t"
             "vpsllq $tmp,$tmp\n\t"
             "vpmuludq $tmp1,$src1,$src2\n\t"
             "vpaddq $dst,$tmp,$tmp1\t! mul packed4L" %}
   ins_encode %{
     int vector_len = 1;
     __ vpshufd($tmp$$XMMRegister, $src2$$XMMRegister, 177, vector_len);
     __ vpmulld($tmp$$XMMRegister, $src1$$XMMRegister, 
$tmp$$XMMRegister, vector_len);
     __ vphaddd($tmp$$XMMRegister, $tmp$$XMMRegister, $tmp$$XMMRegister, 
vector_len);
     __ vpmovzxdq($tmp$$XMMRegister, $tmp$$XMMRegister, vector_len);
     __ vpsllq($tmp$$XMMRegister, $tmp$$XMMRegister, 32, vector_len);
     __ vpmuludq($tmp1$$XMMRegister, $src1$$XMMRegister, 
$src2$$XMMRegister, vector_len);
     __ vpaddq($dst$$XMMRegister, $tmp$$XMMRegister, $tmp1$$XMMRegister, 
vector_len);
   %}
   ins_pipe( pipe_slow );
%}

On 13/07/2018 18:42, Adam Petcher wrote:
> I've been working on an experiment in which I try to improve the 
> performance of X25519[1] using the Vector API and the vectorIntrinsics 
> branch. I started by building this branch and running some tests with 
> the existing code to make sure everything works and get some baseline 
> measurements. I didn't get very far because some tests fail and the code 
> produces incorrect results on my machines.
> 
> I get the same problematic results on both a Haswell Macbook Pro and an 
> Ubuntu VM running on a Skylake Windows laptop. If I modify the VM to 
> remove AVX2 (but leave AVX and everything else), then I don't encounter 
> the problem. I also tried testing on an older Linux machine that doesn't 
> have AVX2, and I don't see the problem.
> 
> The first problem is that an existing test fails: 
> test/jdk/sun/security/util/math/TestIntegerModuloP.java. This test 
> exercises the finite field operations used in X25519, which all boil 
> down to simple arithmetic and bitwise operations on long arrays. The 
> test fails because it produces a result that is incorrect (it is 
> different from the value produced by performing the same operation on 
> BigInteger). To isolate the problem, I developed the simplified test 
> that is attached. This test fails intermittently with default settings, 
> and it seems to fail every time with -Xcomp. I've also attached the 
> assembly that is produced on my Macbook. You can see some AVX 
> instructions near line 3259---this appears to be an auto-vectorization 
> of the loop that multiplies the array by a value. An interesting thing 
> about this test is that I tried to simplify it further, but then the 
> problem goes away. So all of that code seems to be necessary to make the 
> problem happen.
> 
> Is anyone else seeing a similar problem with AVX2? Is this a bug in the 
> intrinsics, or am I doing something wrong here?
> 
> [1] http://openjdk.java.net/jeps/324
>