Problem with auto-vectorized code in AVX2
Vladimir Ivanov
vladimir.x.ivanov at oracle.com
Fri Jul 13 18:09:20 UTC 2018
Thanks for the report, Adam!
It looks like the bug is in vmul4L_reg_avx rule [1]
Auto-vectorizer in the problematic method (TestArrayMult::main) produces
the following tracing output:
new Vector node: 1831 LoadVector === ... #vectory[4]:{long} ...
new Vector node: 1832 ReplicateL === ... #vectory[4]:{long} ...
new Vector node: 1833 MulVL === ... #vectory[4]:{long} ...
new Vector node: 1834 StoreVector === ...
The rule was added by:
changeset: 49241:fda63bc58cd1
branch: vectorIntrinsics
user: srukmannagar
date: Thu Mar 01 14:30:02 2018 -0800
summary: vector intrinsics for long and byte mul
Shravya/Vivek, can you take a look, please?
Best regards,
Vladimir Ivanov
[1]
http://hg.openjdk.java.net/panama/dev/file/c33e709a35e5/src/hotspot/cpu/x86/x86.ad#l13974
instruct vmul4L_reg_avx(vecY dst, vecY src1, vecY src2, vecY tmp, vecY
tmp1) %{
predicate(UseAVX > 1 && n->as_Vector()->length() == 4 &&
VM_Version::supports_avx2());
match(Set dst (MulVL src1 src2));
effect(TEMP tmp1, TEMP tmp);
format %{ "vpshufd $tmp,$src2\n\t"
"vpmulld $tmp,$src1,$tmp\n\t"
"vphaddd $tmp,$tmp,$tmp\n\t"
"vpmovzxdq $tmp,$tmp\n\t"
"vpsllq $tmp,$tmp\n\t"
"vpmuludq $tmp1,$src1,$src2\n\t"
"vpaddq $dst,$tmp,$tmp1\t! mul packed4L" %}
ins_encode %{
int vector_len = 1;
__ vpshufd($tmp$$XMMRegister, $src2$$XMMRegister, 177, vector_len);
__ vpmulld($tmp$$XMMRegister, $src1$$XMMRegister,
$tmp$$XMMRegister, vector_len);
__ vphaddd($tmp$$XMMRegister, $tmp$$XMMRegister, $tmp$$XMMRegister,
vector_len);
__ vpmovzxdq($tmp$$XMMRegister, $tmp$$XMMRegister, vector_len);
__ vpsllq($tmp$$XMMRegister, $tmp$$XMMRegister, 32, vector_len);
__ vpmuludq($tmp1$$XMMRegister, $src1$$XMMRegister,
$src2$$XMMRegister, vector_len);
__ vpaddq($dst$$XMMRegister, $tmp$$XMMRegister, $tmp1$$XMMRegister,
vector_len);
%}
ins_pipe( pipe_slow );
%}
On 13/07/2018 18:42, Adam Petcher wrote:
> I've been working on an experiment in which I try to improve the
> performance of X25519[1] using the Vector API and the vectorIntrinsics
> branch. I started by building this branch and running some tests with
> the existing code to make sure everything works and get some baseline
> measurements. I didn't get very far because some tests fail and the code
> produces incorrect results on my machines.
>
> I get the same problematic results on both a Haswell Macbook Pro and an
> Ubuntu VM running on a Skylake Windows laptop. If I modify the VM to
> remove AVX2 (but leave AVX and everything else), then I don't encounter
> the problem. I also tried testing on an older Linux machine that doesn't
> have AVX2, and I don't see the problem.
>
> The first problem is that an existing test fails:
> test/jdk/sun/security/util/math/TestIntegerModuloP.java. This test
> exercises the finite field operations used in X25519, which all boil
> down to simple arithmetic and bitwise operations on long arrays. The
> test fails because it produces a result that is incorrect (it is
> different from the value produced by performing the same operation on
> BigInteger). To isolate the problem, I developed the simplified test
> that is attached. This test fails intermittently with default settings,
> and it seems to fail every time with -Xcomp. I've also attached the
> assembly that is produced on my Macbook. You can see some AVX
> instructions near line 3259---this appears to be an auto-vectorization
> of the loop that multiplies the array by a value. An interesting thing
> about this test is that I tried to simplify it further, but then the
> problem goes away. So all of that code seems to be necessary to make the
> problem happen.
>
> Is anyone else seeing a similar problem with AVX2? Is this a bug in the
> intrinsics, or am I doing something wrong here?
>
> [1] http://openjdk.java.net/jeps/324
>
More information about the panama-dev
mailing list