RFR: 8221092: UseAVX=3 has performance degredation on Skylake (X7) processors

Tue Oct 8 18:28:55 UTC 2019

Hi Vladimir,

The declaration of vecX operand type takes care of including upper bank appropriately only on architectures that support vl, bw and dq:

reg_class_dynamic vectorx_reg(vectorx_reg_evex, vectorx_reg_legacy, %{ VM_Version::supports_evex() %} );
reg_class_dynamic vectorx_reg_vlbwdq(vectorx_reg_evex, vectorx_reg_legacy, %{ VM_Version::supports_avx512vlbwdq() %} );

operand vecX() %{
  constraint(ALLOC_IN_RC(vectorx_reg_vlbwdq));
  match(VecX);

  format %{ %}
  interface(REG_INTER);
%}

Best Regards,
Sandhya

-----Original Message-----
From: Vladimir Ivanov <vladimir.x.ivanov at oracle.com> 
Sent: Tuesday, October 08, 2019 11:10 AM
To: Deshpande, Vivek R <vivek.r.deshpande at intel.com>; 'hotspot-compiler-dev at openjdk.java.net' <hotspot-compiler-dev at openjdk.java.net>
Cc: 'Robert Strout' <robert.strout at oracle.com>; 'Scott Oaks' <scott.oaks at oracle.com>; Viswanathan, Sandhya <sandhya.viswanathan at intel.com>
Subject: Re: RFR: 8221092: UseAVX=3 has performance degredation on Skylake (X7) processors

Thanks for the clarifications, Vivek.

> VecX has higher bank of registers when there is support for avx512vl avx512bw and avx512dq.
> The VecX works for Repl4F_zero as the xorps calls vpxor when avx512vl bw dq are not available.
> 
> This is the snippet from macroAssembler which is used here:
> void MacroAssembler::xorps(XMMRegister dst, XMMRegister src) {
>    if (UseAVX > 2 && !VM_Version::supports_avx512dq() && (dst->encoding() == src->encoding())) {
>      Assembler::vpxor(dst, dst, src, Assembler::AVX_512bit);
>    } else {
>      Assembler::xorps(dst, src);
>    }
> }

Should supports_avx512vldq() be used instead? Or does the presence of DQ mean VL is also available?

> For the 2nd thing in MoveVecX2Leg, you may be right and could be an opportunity to improve.

Good to know. FTR I'm fine with exploring it separately.

Best regards,
Vladimir Ivanov

> -----Original Message-----
> From: Vladimir Ivanov [mailto:vladimir.x.ivanov at oracle.com]
> Sent: Monday, October 7, 2019 10:18 AM
> To: Deshpande, Vivek R <vivek.r.deshpande at intel.com>; 
> 'hotspot-compiler-dev at openjdk.java.net' 
> <hotspot-compiler-dev at openjdk.java.net>
> Cc: 'Robert Strout' <robert.strout at oracle.com>; 'Scott Oaks' 
> <scott.oaks at oracle.com>
> Subject: Re: RFR: 8221092: UseAVX=3 has performance degredation on 
> Skylake (X7) processors
> 
> Vivek,
> 
> Thinking more about it, I got a question:
> 
> src/hotspot/cpu/x86/x86.ad:
> 
>    instruct Repl4F_zero(vecX dst, immF0 zero) %{
> -  predicate(n->as_Vector()->length() == 4 && UseAVX < 3);
> +  predicate(n->as_Vector()->length() == 4);
>      match(Set dst (ReplicateF zero));
>      format %{ "xorps   $dst,$dst\t! replicate4F zero" %}
>      ins_encode %{
>        __ xorps($dst$$XMMRegister, $dst$$XMMRegister);
>      %}
>      ins_pipe( fpu_reg_reg );
>    %}
> 
> -instruct Repl4F_zero_evex(vecX dst, immF0 zero) %{
> -  predicate(n->as_Vector()->length() == 4 && UseAVX > 2);
> -  match(Set dst (ReplicateF zero));
> -  format %{ "vpxor  $dst k0,$dst,$dst\t! replicate4F zero" %}
> -  ins_encode %{
> -    // Use vpxor in place of vxorps since EVEX has a constriant on dq
> for vxorps: this is a 512-bit operation
> -    int vector_len = 2;
> -    __ vpxor($dst$$XMMRegister,$dst$$XMMRegister, $dst$$XMMRegister,
> vector_len);
> -  %}
> -  ins_pipe( fpu_reg_reg );
> -%}
> 
> Any issues with vecX when encoding xorps with dst in higher half
> (xmm16-31) without avx512vl support? Should it be turned to legVecX?
> 
> 
> 
> Also:
> 
> // Load vectors (16 bytes long)
> instruct MoveVecX2Leg(legVecX dst, vecX src) %{
>     match(Set dst src);
>     format %{ "movdqu $dst,$src\t! load vector (16 bytes)" %}
>     ins_encode %{
>       if (UseAVX > 2 && !VM_Version::supports_avx512vl()) {
>         int vector_len = 2;
>         __ evmovdquq($dst$$XMMRegister, $src$$XMMRegister, vector_len);
>       } else {
>         __ movdqu($dst$$XMMRegister, $src$$XMMRegister);
>       }
>     %}
>     ins_pipe( fpu_reg_reg );
> %}
> 
>   From performance perspective, does it make any sense to further 
> limit EVEX-encoded case when src is in lower half of the range (0-15)?
> 
> 
> Best regards,
> Vladimir Ivanov
> 
> On 07/10/2019 14:03, Vladimir Ivanov wrote:
>>
>>> http://cr.openjdk.java.net/~vdeshpande/8221092/webrev.01/
>>
>> Looks good.
>>
>> Best regards,
>> Vladimir Ivanov
>>
>>> -----Original Message-----
>>> From: Deshpande, Vivek R
>>> Sent: Thursday, September 26, 2019 10:52 AM
>>> To: Vladimir Kozlov <vladimir.kozlov at oracle.com>; 
>>> hotspot-compiler-dev at openjdk.java.net
>>> Cc: Scott Oaks <scott.oaks at oracle.com>; eric.caspole 
>>> <eric.caspole at oracle.com>; Robert Strout <robert.strout at oracle.com>
>>> Subject: RE: RFR: 8221092: UseAVX=3 has performance degredation on 
>>> Skylake (X7) processors
>>>
>>> Thanks you Vladimir for the review.
>>> I will work on the adding the comments and changes to the bug report.
>>>
>>> Yes the threshold is for architectures after skylake which support 
>>> AVX512.
>>> With this threshold( value = 4096 bytes found experimentally), 
>>> AVX512 will be used if the array size is bigger than that.
>>>
>>> Regards,
>>> Vivek
>>>
>>> -----Original Message-----
>>> From: Vladimir Kozlov [mailto:vladimir.kozlov at oracle.com]
>>> Sent: Wednesday, September 25, 2019 12:41 PM
>>> To: Deshpande, Vivek R <vivek.r.deshpande at intel.com>; 
>>> hotspot-compiler-dev at openjdk.java.net
>>> Cc: Scott Oaks <scott.oaks at oracle.com>; eric.caspole 
>>> <eric.caspole at oracle.com>; Robert Strout <robert.strout at oracle.com>
>>> Subject: Re: RFR: 8221092: UseAVX=3 has performance degredation on 
>>> Skylake (X7) processors
>>>
>>> Thank you, Vivek
>>>
>>> I see you did several changes including intrinsics code. Would be 
>>> nice if you list changes you did in bug report. I see you removed 
>>> _evex instructions variants in .ad file, replaced evex instructions 
>>> in stubs and set UseAVX to 2 for Skylake. It is easy to understand.
>>>
>>> But what about array limit AVX3Threshold? I assume it is for 
>>> non-Skylake CPUs with AVX512. Right?
>>> What number 4096 is based on. It seems AVX3Threshold == 0 has 
>>> special meaning - add line in globals_x86.hpp explaining it. I would 
>>> need more time to look on related changes.
>>>
>>> Thanks,
>>> Vladimir
>>>
>>> On 9/3/19 5:02 PM, Deshpande, Vivek R wrote:
>>>> Hi All
>>>>
>>>> I have created a patch which sets AVX2 for Skylake and selectively 
>>>> uses EVEX instructions based on threshold for platforms after Skylake.
>>>> I don't observe the regressions for SPECjvm2008 on Skylake with 
>>>> this patch.
>>>> JBS id:
>>>> https://bugs.openjdk.java.net/browse/JDK-8221092
>>>> Webrev:
>>>> http://cr.openjdk.java.net/~vdeshpande/8221092/webrev.00/
>>>> Could you all please review the patch.
>>>>
>>>> Regards,
>>>> Vivek
>>>>