RFR(S): 8178811: Minimize the AVX <-> SSE transition penalty on x86

Tue May 2 18:19:35 UTC 2017

The build failed:

#  Internal Error (/scratch/opt/jprt/T/P1/175733.vkozlov/s/hotspot/src/cpu/x86/vm/vm_version_x86.hpp:640), pid=7347, tid=7348
#  assert(_cpuid_info.std_cpuid1_eax.bits.family != 0) failed: VM_Version not initialized

V  [libjvm.so+0x93a6a0]  report_vm_error(char const*, int, char const*, char const*, ...)+0x60
V  [libjvm.so+0x14cb1f0]  VM_Version_StubGenerator::generate_get_cpu_info()+0x2f20
V  [libjvm.so+0x14c7f53]  VM_Version::initialize()+0x103

I think it fails at check is_intel() in set_avx_cpuFeatures() or set_evex_cpuFeatures().
Please, build fastdebug JVM for testing.

Vladimir

On 5/2/17 10:59 AM, Vladimir Kozlov wrote:
> Looks good. One nit - move next code into separate function instead of duplicating 3 times:
>
> +    if( is_intel() ) { // Intel cpus specific settings
> +      if ((cpu_family() == 0x06) &&
> +          ((extended_cpu_model() == 0x57) ||   // Xeon Phi 3200/5200/7200
> +          (extended_cpu_model() == 0x85))) {  // Future Xeon Phi
> +        _features &= ~CPU_VZEROUPPER;
> +      }
> +    }
>
> I will run testing for current changes and let you know results.
>
> Thanks,
> Vladimir
>
> On 4/20/17 1:54 PM, Deshpande, Vivek R wrote:
>> HI Vladimir
>>
>> We added almost all the vzeroupper instructions by analyzing SPECjbb2015.
>>
>> We need vzeroupper after we execute avx2/avx-512 and before transition to SSE to avoid penalty of saving and restoring higher bits in YMM/ZMM registers. Since JVM is SSE compiled, mixing of AVX and
>> SSE code happens in C1 and interpreter with usage of intrinsics, so I have added vzerouppers at these transitions in the stubs.
>>
>> I think max_vector_size() > 16 would be with auto vectorization, but we need to have vzeroupper with stubs and intrinsics.
>>
>> I have made all the changes which you suggested. Please take a look at the patch and let me know your thoughts.
>>
>> The updated Webrev is here:
>> http://cr.openjdk.java.net/~vdeshpande/8178811/webrev.01/
>>
>> Thank you.
>> Regards,
>> Vivek
>>
>> -----Original Message-----
>> From: Vladimir Kozlov [mailto:vladimir.kozlov at oracle.com]
>> Sent: Friday, April 14, 2017 12:48 PM
>> To: Deshpande, Vivek R; hotspot compiler
>> Cc: Viswanathan, Sandhya
>> Subject: Re: RFR(S): 8178811: Minimize the AVX <-> SSE transition penalty on x86
>>
>> Hi Vivek,
>>
>> Did you pinpoint particular change which helps SPECjbb2015?
>>
>>  From what I remember it is mostly during call to runtime or JNI calls.
>>
>>    // AVX instruction which is used to clear upper 128 bits of YMM registers and
>>    // to avoid transaction penalty between AVX and SSE states. There is no
>>    // penalty if legacy SSE instructions are encoded using VEX prefix because
>>    // they always clear upper 128 bits. It should be used before calling
>>    // runtime code and native libraries.
>>    void vzeroupper();
>>
>> So why you added vzeroupper() into arraycopy and other stubs? If AVX is supported all instructions should be encoded with VEX prefix. Or the statement in the comment is incorrect?
>>
>> About UseVzeroupper flag. We should avoid adding new product flags if possible.
>> IT only make sense if you want to do experiments to check performance with and without it. But we already know that it is needed on most CPUs.
>>
>> I would suggest to add VM_Version::supports_vzeroupper() { return (_features & CPU_VZEROUPPER) != 0; } Set CPU_VZEROUPPE if AVX is supported and clear it for Knights CPU. The add check inside
>> assembler instruction:
>>
>>    void Assembler::vzeroupper() {
>> -   assert(VM_Version::supports_avx(), "");
>> +   if (VM_Version::supports_vzeroupper()) {
>>
>> It will allow to avoid supports_avx() checks on each vzeroupper() call.
>>
>> You missed check in MachEpilogNode::emit() in x86_64.ad
>>
>> Note, I used (C->max_vector_size() > 16) checks in .ad files because of the same reasons as above:
>>
>>    // There is no penalty if legacy SSE instructions are encoded using VEX prefix because
>>    // they always clear upper 128 bits.
>>
>> I thought if vectors are small and we use VEX prefix then upper bits will be 0 anyway so you don't need vzeroupper().
>>
>> Thanks,
>> Vladimir
>>
>> On 4/14/17 11:17 AM, Deshpande, Vivek R wrote:
>>> Hi All
>>>
>>>
>>>
>>> This fix minimizes the AVX to SSE and SSE to AVX transition penalty
>>> through generation of vzeroupper instruction. With this patch we see
>>> zero transitions with penalty per SPECjbb2015 jOPS on BDW and a significant reduction on SKX CPU event vector width mismatch from 65 to 0.01 per SPECjbb2015 jOPS. We have also implemented an
>>> enhancement to disable vzeroupper generation for Knights family where the instruction has high penalty and is not recommended. The option UseVzeroupper is used to control generation of vzeroupper
>>> instruction and gets set to false on the Knights family.
>>> We observed ~3% gain on SPECJvm2008 composite result on Skylake.
>>>
>>> Webrev:
>>>
>>> http://cr.openjdk.java.net/~vdeshpande/8178811/webrev.00/
>>>
>>> I have also updated the JBS entry.
>>>
>>> https://bugs.openjdk.java.net/browse/JDK-8178811
>>>
>>> Would you please review and sponsor it.
>>>
>>>
>>>
>>> Regards,
>>>
>>> Vivek
>>>
>>>
>>>