RFR: 8221092: UseAVX=3 has performance degredation on Skylake (X7) processors
Deshpande, Vivek R
vivek.r.deshpande at intel.com
Tue Oct 8 17:50:10 UTC 2019
Hi Vladimir
VecX has higher bank of registers when there is support for avx512vl avx512bw and avx512dq.
The VecX works for Repl4F_zero as the xorps calls vpxor when avx512vl bw dq are not available.
This is the snippet from macroAssembler which is used here:
void MacroAssembler::xorps(XMMRegister dst, XMMRegister src) {
if (UseAVX > 2 && !VM_Version::supports_avx512dq() && (dst->encoding() == src->encoding())) {
Assembler::vpxor(dst, dst, src, Assembler::AVX_512bit);
} else {
Assembler::xorps(dst, src);
}
}
For the 2nd thing in MoveVecX2Leg, you may be right and could be an opportunity to improve.
Regards,
Vivek
-----Original Message-----
From: Vladimir Ivanov [mailto:vladimir.x.ivanov at oracle.com]
Sent: Monday, October 7, 2019 10:18 AM
To: Deshpande, Vivek R <vivek.r.deshpande at intel.com>; 'hotspot-compiler-dev at openjdk.java.net' <hotspot-compiler-dev at openjdk.java.net>
Cc: 'Robert Strout' <robert.strout at oracle.com>; 'Scott Oaks' <scott.oaks at oracle.com>
Subject: Re: RFR: 8221092: UseAVX=3 has performance degredation on Skylake (X7) processors
Vivek,
Thinking more about it, I got a question:
src/hotspot/cpu/x86/x86.ad:
instruct Repl4F_zero(vecX dst, immF0 zero) %{
- predicate(n->as_Vector()->length() == 4 && UseAVX < 3);
+ predicate(n->as_Vector()->length() == 4);
match(Set dst (ReplicateF zero));
format %{ "xorps $dst,$dst\t! replicate4F zero" %}
ins_encode %{
__ xorps($dst$$XMMRegister, $dst$$XMMRegister);
%}
ins_pipe( fpu_reg_reg );
%}
-instruct Repl4F_zero_evex(vecX dst, immF0 zero) %{
- predicate(n->as_Vector()->length() == 4 && UseAVX > 2);
- match(Set dst (ReplicateF zero));
- format %{ "vpxor $dst k0,$dst,$dst\t! replicate4F zero" %}
- ins_encode %{
- // Use vpxor in place of vxorps since EVEX has a constriant on dq
for vxorps: this is a 512-bit operation
- int vector_len = 2;
- __ vpxor($dst$$XMMRegister,$dst$$XMMRegister, $dst$$XMMRegister,
vector_len);
- %}
- ins_pipe( fpu_reg_reg );
-%}
Any issues with vecX when encoding xorps with dst in higher half
(xmm16-31) without avx512vl support? Should it be turned to legVecX?
Also:
// Load vectors (16 bytes long)
instruct MoveVecX2Leg(legVecX dst, vecX src) %{
match(Set dst src);
format %{ "movdqu $dst,$src\t! load vector (16 bytes)" %}
ins_encode %{
if (UseAVX > 2 && !VM_Version::supports_avx512vl()) {
int vector_len = 2;
__ evmovdquq($dst$$XMMRegister, $src$$XMMRegister, vector_len);
} else {
__ movdqu($dst$$XMMRegister, $src$$XMMRegister);
}
%}
ins_pipe( fpu_reg_reg );
%}
From performance perspective, does it make any sense to further limit
EVEX-encoded case when src is in lower half of the range (0-15)?
Best regards,
Vladimir Ivanov
On 07/10/2019 14:03, Vladimir Ivanov wrote:
>
>> http://cr.openjdk.java.net/~vdeshpande/8221092/webrev.01/
>
> Looks good.
>
> Best regards,
> Vladimir Ivanov
>
>> -----Original Message-----
>> From: Deshpande, Vivek R
>> Sent: Thursday, September 26, 2019 10:52 AM
>> To: Vladimir Kozlov <vladimir.kozlov at oracle.com>;
>> hotspot-compiler-dev at openjdk.java.net
>> Cc: Scott Oaks <scott.oaks at oracle.com>; eric.caspole
>> <eric.caspole at oracle.com>; Robert Strout <robert.strout at oracle.com>
>> Subject: RE: RFR: 8221092: UseAVX=3 has performance degredation on
>> Skylake (X7) processors
>>
>> Thanks you Vladimir for the review.
>> I will work on the adding the comments and changes to the bug report.
>>
>> Yes the threshold is for architectures after skylake which support
>> AVX512.
>> With this threshold( value = 4096 bytes found experimentally), AVX512
>> will be used if the array size is bigger than that.
>>
>> Regards,
>> Vivek
>>
>> -----Original Message-----
>> From: Vladimir Kozlov [mailto:vladimir.kozlov at oracle.com]
>> Sent: Wednesday, September 25, 2019 12:41 PM
>> To: Deshpande, Vivek R <vivek.r.deshpande at intel.com>;
>> hotspot-compiler-dev at openjdk.java.net
>> Cc: Scott Oaks <scott.oaks at oracle.com>; eric.caspole
>> <eric.caspole at oracle.com>; Robert Strout <robert.strout at oracle.com>
>> Subject: Re: RFR: 8221092: UseAVX=3 has performance degredation on
>> Skylake (X7) processors
>>
>> Thank you, Vivek
>>
>> I see you did several changes including intrinsics code. Would be nice
>> if you list changes you did in bug report. I see you removed _evex
>> instructions variants in .ad file, replaced evex instructions in stubs
>> and set UseAVX to 2 for Skylake. It is easy to understand.
>>
>> But what about array limit AVX3Threshold? I assume it is for
>> non-Skylake CPUs with AVX512. Right?
>> What number 4096 is based on. It seems AVX3Threshold == 0 has special
>> meaning - add line in globals_x86.hpp explaining it. I would need more
>> time to look on related changes.
>>
>> Thanks,
>> Vladimir
>>
>> On 9/3/19 5:02 PM, Deshpande, Vivek R wrote:
>>> Hi All
>>>
>>> I have created a patch which sets AVX2 for Skylake and selectively
>>> uses EVEX instructions based on threshold for platforms after Skylake.
>>> I don't observe the regressions for SPECjvm2008 on Skylake with this
>>> patch.
>>> JBS id:
>>> https://bugs.openjdk.java.net/browse/JDK-8221092
>>> Webrev:
>>> http://cr.openjdk.java.net/~vdeshpande/8221092/webrev.00/
>>> Could you all please review the patch.
>>>
>>> Regards,
>>> Vivek
>>>
More information about the hotspot-compiler-dev
mailing list