RFR: 8221092: UseAVX=3 has performance degredation on Skylake (X7) processors

Tue Oct 8 18:37:12 UTC 2019

> The declaration of vecX operand type takes care of including upper bank appropriately only on architectures that support vl, bw and dq:

Ah, now I got it. Thanks!

Best regards,
Vladimir Ivanov

> 
> reg_class_dynamic vectorx_reg(vectorx_reg_evex, vectorx_reg_legacy, %{ VM_Version::supports_evex() %} );
> reg_class_dynamic vectorx_reg_vlbwdq(vectorx_reg_evex, vectorx_reg_legacy, %{ VM_Version::supports_avx512vlbwdq() %} );
> 
> operand vecX() %{
>    constraint(ALLOC_IN_RC(vectorx_reg_vlbwdq));
>    match(VecX);
> 
>    format %{ %}
>    interface(REG_INTER);
> %}
> 
> Best Regards,
> Sandhya
> 
> 
> 
> -----Original Message-----
> From: Vladimir Ivanov <vladimir.x.ivanov at oracle.com>
> Sent: Tuesday, October 08, 2019 11:10 AM
> To: Deshpande, Vivek R <vivek.r.deshpande at intel.com>; 'hotspot-compiler-dev at openjdk.java.net' <hotspot-compiler-dev at openjdk.java.net>
> Cc: 'Robert Strout' <robert.strout at oracle.com>; 'Scott Oaks' <scott.oaks at oracle.com>; Viswanathan, Sandhya <sandhya.viswanathan at intel.com>
> Subject: Re: RFR: 8221092: UseAVX=3 has performance degredation on Skylake (X7) processors
> 
> Thanks for the clarifications, Vivek.
> 
>> VecX has higher bank of registers when there is support for avx512vl avx512bw and avx512dq.
>> The VecX works for Repl4F_zero as the xorps calls vpxor when avx512vl bw dq are not available.
>>
>> This is the snippet from macroAssembler which is used here:
>> void MacroAssembler::xorps(XMMRegister dst, XMMRegister src) {
>>     if (UseAVX > 2 && !VM_Version::supports_avx512dq() && (dst->encoding() == src->encoding())) {
>>       Assembler::vpxor(dst, dst, src, Assembler::AVX_512bit);
>>     } else {
>>       Assembler::xorps(dst, src);
>>     }
>> }
> 
> Should supports_avx512vldq() be used instead? Or does the presence of DQ mean VL is also available?
> 
>> For the 2nd thing in MoveVecX2Leg, you may be right and could be an opportunity to improve.
> 
> Good to know. FTR I'm fine with exploring it separately.
> 
> Best regards,
> Vladimir Ivanov
> 
>> -----Original Message-----
>> From: Vladimir Ivanov [mailto:vladimir.x.ivanov at oracle.com]
>> Sent: Monday, October 7, 2019 10:18 AM
>> To: Deshpande, Vivek R <vivek.r.deshpande at intel.com>;
>> 'hotspot-compiler-dev at openjdk.java.net'
>> <hotspot-compiler-dev at openjdk.java.net>
>> Cc: 'Robert Strout' <robert.strout at oracle.com>; 'Scott Oaks'
>> <scott.oaks at oracle.com>
>> Subject: Re: RFR: 8221092: UseAVX=3 has performance degredation on
>> Skylake (X7) processors
>>
>> Vivek,
>>
>> Thinking more about it, I got a question:
>>
>> src/hotspot/cpu/x86/x86.ad:
>>
>>     instruct Repl4F_zero(vecX dst, immF0 zero) %{
>> -  predicate(n->as_Vector()->length() == 4 && UseAVX < 3);
>> +  predicate(n->as_Vector()->length() == 4);
>>       match(Set dst (ReplicateF zero));
>>       format %{ "xorps   $dst,$dst\t! replicate4F zero" %}
>>       ins_encode %{
>>         __ xorps($dst$$XMMRegister, $dst$$XMMRegister);
>>       %}
>>       ins_pipe( fpu_reg_reg );
>>     %}
>>
>> -instruct Repl4F_zero_evex(vecX dst, immF0 zero) %{
>> -  predicate(n->as_Vector()->length() == 4 && UseAVX > 2);
>> -  match(Set dst (ReplicateF zero));
>> -  format %{ "vpxor  $dst k0,$dst,$dst\t! replicate4F zero" %}
>> -  ins_encode %{
>> -    // Use vpxor in place of vxorps since EVEX has a constriant on dq
>> for vxorps: this is a 512-bit operation
>> -    int vector_len = 2;
>> -    __ vpxor($dst$$XMMRegister,$dst$$XMMRegister, $dst$$XMMRegister,
>> vector_len);
>> -  %}
>> -  ins_pipe( fpu_reg_reg );
>> -%}
>>
>> Any issues with vecX when encoding xorps with dst in higher half
>> (xmm16-31) without avx512vl support? Should it be turned to legVecX?
>>
>>
>>
>> Also:
>>
>> // Load vectors (16 bytes long)
>> instruct MoveVecX2Leg(legVecX dst, vecX src) %{
>>      match(Set dst src);
>>      format %{ "movdqu $dst,$src\t! load vector (16 bytes)" %}
>>      ins_encode %{
>>        if (UseAVX > 2 && !VM_Version::supports_avx512vl()) {
>>          int vector_len = 2;
>>          __ evmovdquq($dst$$XMMRegister, $src$$XMMRegister, vector_len);
>>        } else {
>>          __ movdqu($dst$$XMMRegister, $src$$XMMRegister);
>>        }
>>      %}
>>      ins_pipe( fpu_reg_reg );
>> %}
>>
>>    From performance perspective, does it make any sense to further
>> limit EVEX-encoded case when src is in lower half of the range (0-15)?
>>
>>
>> Best regards,
>> Vladimir Ivanov
>>
>> On 07/10/2019 14:03, Vladimir Ivanov wrote:
>>>
>>>> http://cr.openjdk.java.net/~vdeshpande/8221092/webrev.01/
>>>
>>> Looks good.
>>>
>>> Best regards,
>>> Vladimir Ivanov
>>>
>>>> -----Original Message-----
>>>> From: Deshpande, Vivek R
>>>> Sent: Thursday, September 26, 2019 10:52 AM
>>>> To: Vladimir Kozlov <vladimir.kozlov at oracle.com>;
>>>> hotspot-compiler-dev at openjdk.java.net
>>>> Cc: Scott Oaks <scott.oaks at oracle.com>; eric.caspole
>>>> <eric.caspole at oracle.com>; Robert Strout <robert.strout at oracle.com>
>>>> Subject: RE: RFR: 8221092: UseAVX=3 has performance degredation on
>>>> Skylake (X7) processors
>>>>
>>>> Thanks you Vladimir for the review.
>>>> I will work on the adding the comments and changes to the bug report.
>>>>
>>>> Yes the threshold is for architectures after skylake which support
>>>> AVX512.
>>>> With this threshold( value = 4096 bytes found experimentally),
>>>> AVX512 will be used if the array size is bigger than that.
>>>>
>>>> Regards,
>>>> Vivek
>>>>
>>>> -----Original Message-----
>>>> From: Vladimir Kozlov [mailto:vladimir.kozlov at oracle.com]
>>>> Sent: Wednesday, September 25, 2019 12:41 PM
>>>> To: Deshpande, Vivek R <vivek.r.deshpande at intel.com>;
>>>> hotspot-compiler-dev at openjdk.java.net
>>>> Cc: Scott Oaks <scott.oaks at oracle.com>; eric.caspole
>>>> <eric.caspole at oracle.com>; Robert Strout <robert.strout at oracle.com>
>>>> Subject: Re: RFR: 8221092: UseAVX=3 has performance degredation on
>>>> Skylake (X7) processors
>>>>
>>>> Thank you, Vivek
>>>>
>>>> I see you did several changes including intrinsics code. Would be
>>>> nice if you list changes you did in bug report. I see you removed
>>>> _evex instructions variants in .ad file, replaced evex instructions
>>>> in stubs and set UseAVX to 2 for Skylake. It is easy to understand.
>>>>
>>>> But what about array limit AVX3Threshold? I assume it is for
>>>> non-Skylake CPUs with AVX512. Right?
>>>> What number 4096 is based on. It seems AVX3Threshold == 0 has
>>>> special meaning - add line in globals_x86.hpp explaining it. I would
>>>> need more time to look on related changes.
>>>>
>>>> Thanks,
>>>> Vladimir
>>>>
>>>> On 9/3/19 5:02 PM, Deshpande, Vivek R wrote:
>>>>> Hi All
>>>>>
>>>>> I have created a patch which sets AVX2 for Skylake and selectively
>>>>> uses EVEX instructions based on threshold for platforms after Skylake.
>>>>> I don't observe the regressions for SPECjvm2008 on Skylake with
>>>>> this patch.
>>>>> JBS id:
>>>>> https://bugs.openjdk.java.net/browse/JDK-8221092
>>>>> Webrev:
>>>>> http://cr.openjdk.java.net/~vdeshpande/8221092/webrev.00/
>>>>> Could you all please review the patch.
>>>>>
>>>>> Regards,
>>>>> Vivek
>>>>>