RFR: 8221092: UseAVX=3 has performance degredation on Skylake (X7) processors

Tue Oct 8 17:50:10 UTC 2019

Hi Vladimir

VecX has higher bank of registers when there is support for avx512vl avx512bw and avx512dq. 
The VecX works for Repl4F_zero as the xorps calls vpxor when avx512vl bw dq are not available. 

This is the snippet from macroAssembler which is used here:
void MacroAssembler::xorps(XMMRegister dst, XMMRegister src) {
  if (UseAVX > 2 && !VM_Version::supports_avx512dq() && (dst->encoding() == src->encoding())) {
    Assembler::vpxor(dst, dst, src, Assembler::AVX_512bit);
  } else {
    Assembler::xorps(dst, src);
  }
}

For the 2nd thing in MoveVecX2Leg, you may be right and could be an opportunity to improve.

Regards,
Vivek
-----Original Message-----
From: Vladimir Ivanov [mailto:vladimir.x.ivanov at oracle.com] 
Sent: Monday, October 7, 2019 10:18 AM
To: Deshpande, Vivek R <vivek.r.deshpande at intel.com>; 'hotspot-compiler-dev at openjdk.java.net' <hotspot-compiler-dev at openjdk.java.net>
Cc: 'Robert Strout' <robert.strout at oracle.com>; 'Scott Oaks' <scott.oaks at oracle.com>
Subject: Re: RFR: 8221092: UseAVX=3 has performance degredation on Skylake (X7) processors

Vivek,

Thinking more about it, I got a question:

src/hotspot/cpu/x86/x86.ad:

  instruct Repl4F_zero(vecX dst, immF0 zero) %{
-  predicate(n->as_Vector()->length() == 4 && UseAVX < 3);
+  predicate(n->as_Vector()->length() == 4);
    match(Set dst (ReplicateF zero));
    format %{ "xorps   $dst,$dst\t! replicate4F zero" %}
    ins_encode %{
      __ xorps($dst$$XMMRegister, $dst$$XMMRegister);
    %}
    ins_pipe( fpu_reg_reg );
  %}

-instruct Repl4F_zero_evex(vecX dst, immF0 zero) %{
-  predicate(n->as_Vector()->length() == 4 && UseAVX > 2);
-  match(Set dst (ReplicateF zero));
-  format %{ "vpxor  $dst k0,$dst,$dst\t! replicate4F zero" %}
-  ins_encode %{
-    // Use vpxor in place of vxorps since EVEX has a constriant on dq 
for vxorps: this is a 512-bit operation
-    int vector_len = 2;
-    __ vpxor($dst$$XMMRegister,$dst$$XMMRegister, $dst$$XMMRegister, 
vector_len);
-  %}
-  ins_pipe( fpu_reg_reg );
-%}

Any issues with vecX when encoding xorps with dst in higher half 
(xmm16-31) without avx512vl support? Should it be turned to legVecX?

Also:

// Load vectors (16 bytes long)
instruct MoveVecX2Leg(legVecX dst, vecX src) %{
   match(Set dst src);
   format %{ "movdqu $dst,$src\t! load vector (16 bytes)" %}
   ins_encode %{
     if (UseAVX > 2 && !VM_Version::supports_avx512vl()) {
       int vector_len = 2;
       __ evmovdquq($dst$$XMMRegister, $src$$XMMRegister, vector_len);
     } else {
       __ movdqu($dst$$XMMRegister, $src$$XMMRegister);
     }
   %}
   ins_pipe( fpu_reg_reg );
%}

 From performance perspective, does it make any sense to further limit 
EVEX-encoded case when src is in lower half of the range (0-15)?

Best regards,
Vladimir Ivanov

On 07/10/2019 14:03, Vladimir Ivanov wrote:
> 
>> http://cr.openjdk.java.net/~vdeshpande/8221092/webrev.01/
> 
> Looks good.
> 
> Best regards,
> Vladimir Ivanov
> 
>> -----Original Message-----
>> From: Deshpande, Vivek R
>> Sent: Thursday, September 26, 2019 10:52 AM
>> To: Vladimir Kozlov <vladimir.kozlov at oracle.com>; 
>> hotspot-compiler-dev at openjdk.java.net
>> Cc: Scott Oaks <scott.oaks at oracle.com>; eric.caspole 
>> <eric.caspole at oracle.com>; Robert Strout <robert.strout at oracle.com>
>> Subject: RE: RFR: 8221092: UseAVX=3 has performance degredation on 
>> Skylake (X7) processors
>>
>> Thanks you Vladimir for the review.
>> I will work on the adding the comments and changes to the bug report.
>>
>> Yes the threshold is for architectures after skylake which support 
>> AVX512.
>> With this threshold( value = 4096 bytes found experimentally), AVX512 
>> will be used if the array size is bigger than that.
>>
>> Regards,
>> Vivek
>>
>> -----Original Message-----
>> From: Vladimir Kozlov [mailto:vladimir.kozlov at oracle.com]
>> Sent: Wednesday, September 25, 2019 12:41 PM
>> To: Deshpande, Vivek R <vivek.r.deshpande at intel.com>; 
>> hotspot-compiler-dev at openjdk.java.net
>> Cc: Scott Oaks <scott.oaks at oracle.com>; eric.caspole 
>> <eric.caspole at oracle.com>; Robert Strout <robert.strout at oracle.com>
>> Subject: Re: RFR: 8221092: UseAVX=3 has performance degredation on 
>> Skylake (X7) processors
>>
>> Thank you, Vivek
>>
>> I see you did several changes including intrinsics code. Would be nice 
>> if you list changes you did in bug report. I see you removed _evex 
>> instructions variants in .ad file, replaced evex instructions in stubs 
>> and set UseAVX to 2 for Skylake. It is easy to understand.
>>
>> But what about array limit AVX3Threshold? I assume it is for 
>> non-Skylake CPUs with AVX512. Right?
>> What number 4096 is based on. It seems AVX3Threshold == 0 has special 
>> meaning - add line in globals_x86.hpp explaining it. I would need more 
>> time to look on related changes.
>>
>> Thanks,
>> Vladimir
>>
>> On 9/3/19 5:02 PM, Deshpande, Vivek R wrote:
>>> Hi All
>>>
>>> I have created a patch which sets AVX2 for Skylake and selectively
>>> uses EVEX instructions based on threshold for platforms after Skylake.
>>> I don't observe the regressions for SPECjvm2008 on Skylake with this 
>>> patch.
>>> JBS id:
>>> https://bugs.openjdk.java.net/browse/JDK-8221092
>>> Webrev:
>>> http://cr.openjdk.java.net/~vdeshpande/8221092/webrev.00/
>>> Could you all please review the patch.
>>>
>>> Regards,
>>> Vivek
>>>