Problem with auto-vectorized code in AVX2
Rukmannagari, Shravya
shravya.rukmannagari at intel.com
Mon Jul 16 20:53:24 UTC 2018
Hi Vladimir,
I'll take a look at this.
Thanks,
Shravya.
-----Original Message-----
From: Vladimir Ivanov [mailto:vladimir.x.ivanov at oracle.com]
Sent: Friday, July 13, 2018 11:37 AM
To: panama-dev at openjdk.java.net; Rukmannagari, Shravya <shravya.rukmannagari at intel.com>; Deshpande, Vivek R <vivek.r.deshpande at intel.com>
Cc: Adam Petcher <adam.petcher at oracle.com>
Subject: Re: Problem with auto-vectorized code in AVX2
I haven't dove into details, but in case of -XX:UseAVX=1 2-element long vector & vmul2L_reg_avx are used and the test works fine.
The only difference between vmul2L_reg_avx and vmul4L_reg_avx is vector_len.
Best regards,
Vladimir Ivanov
On 13/07/2018 21:09, Vladimir Ivanov wrote:
> Thanks for the report, Adam!
>
> It looks like the bug is in vmul4L_reg_avx rule [1]
>
> Auto-vectorizer in the problematic method (TestArrayMult::main)
> produces the following tracing output:
> new Vector node: 1831 LoadVector === ... #vectory[4]:{long} ...
> new Vector node: 1832 ReplicateL === ... #vectory[4]:{long} ...
> new Vector node: 1833 MulVL === ... #vectory[4]:{long} ...
> new Vector node: 1834 StoreVector === ...
>
> The rule was added by:
> changeset: 49241:fda63bc58cd1
> branch: vectorIntrinsics
> user: srukmannagar
> date: Thu Mar 01 14:30:02 2018 -0800
> summary: vector intrinsics for long and byte mul
>
> Shravya/Vivek, can you take a look, please?
>
> Best regards,
> Vladimir Ivanov
>
> [1]
> http://hg.openjdk.java.net/panama/dev/file/c33e709a35e5/src/hotspot/cp
> u/x86/x86.ad#l13974
>
>
> instruct vmul4L_reg_avx(vecY dst, vecY src1, vecY src2, vecY tmp, vecY
> tmp1) %{
> predicate(UseAVX > 1 && n->as_Vector()->length() == 4 &&
> VM_Version::supports_avx2());
> match(Set dst (MulVL src1 src2));
> effect(TEMP tmp1, TEMP tmp);
> format %{ "vpshufd $tmp,$src2\n\t"
> "vpmulld $tmp,$src1,$tmp\n\t"
> "vphaddd $tmp,$tmp,$tmp\n\t"
> "vpmovzxdq $tmp,$tmp\n\t"
> "vpsllq $tmp,$tmp\n\t"
> "vpmuludq $tmp1,$src1,$src2\n\t"
> "vpaddq $dst,$tmp,$tmp1\t! mul packed4L" %}
> ins_encode %{
> int vector_len = 1;
> __ vpshufd($tmp$$XMMRegister, $src2$$XMMRegister, 177,
> vector_len);
> __ vpmulld($tmp$$XMMRegister, $src1$$XMMRegister,
> $tmp$$XMMRegister, vector_len);
> __ vphaddd($tmp$$XMMRegister, $tmp$$XMMRegister,
> $tmp$$XMMRegister, vector_len);
> __ vpmovzxdq($tmp$$XMMRegister, $tmp$$XMMRegister, vector_len);
> __ vpsllq($tmp$$XMMRegister, $tmp$$XMMRegister, 32, vector_len);
> __ vpmuludq($tmp1$$XMMRegister, $src1$$XMMRegister,
> $src2$$XMMRegister, vector_len);
> __ vpaddq($dst$$XMMRegister, $tmp$$XMMRegister,
> $tmp1$$XMMRegister, vector_len);
> %}
> ins_pipe( pipe_slow );
> %}
>
>
>
> On 13/07/2018 18:42, Adam Petcher wrote:
>> I've been working on an experiment in which I try to improve the
>> performance of X25519[1] using the Vector API and the
>> vectorIntrinsics branch. I started by building this branch and
>> running some tests with the existing code to make sure everything
>> works and get some baseline measurements. I didn't get very far
>> because some tests fail and the code produces incorrect results on my machines.
>>
>> I get the same problematic results on both a Haswell Macbook Pro and
>> an Ubuntu VM running on a Skylake Windows laptop. If I modify the VM
>> to remove AVX2 (but leave AVX and everything else), then I don't
>> encounter the problem. I also tried testing on an older Linux machine
>> that doesn't have AVX2, and I don't see the problem.
>>
>> The first problem is that an existing test fails:
>> test/jdk/sun/security/util/math/TestIntegerModuloP.java. This test
>> exercises the finite field operations used in X25519, which all boil
>> down to simple arithmetic and bitwise operations on long arrays. The
>> test fails because it produces a result that is incorrect (it is
>> different from the value produced by performing the same operation on
>> BigInteger). To isolate the problem, I developed the simplified test
>> that is attached. This test fails intermittently with default
>> settings, and it seems to fail every time with -Xcomp. I've also
>> attached the assembly that is produced on my Macbook. You can see
>> some AVX instructions near line 3259---this appears to be an
>> auto-vectorization of the loop that multiplies the array by a value.
>> An interesting thing about this test is that I tried to simplify it
>> further, but then the problem goes away. So all of that code seems to
>> be necessary to make the problem happen.
>>
>> Is anyone else seeing a similar problem with AVX2? Is this a bug in
>> the intrinsics, or am I doing something wrong here?
>>
>> [1] http://openjdk.java.net/jeps/324
>>
More information about the panama-dev
mailing list