RFR(S): 8178811: Minimize the AVX <-> SSE transition penalty on x86
Deshpande, Vivek R
vivek.r.deshpande at intel.com
Tue May 2 19:26:55 UTC 2017
Hi Vladimir
I will check and get back to you soon.
Thanks.
Regards,
Vivek
-----Original Message-----
From: Vladimir Kozlov [mailto:vladimir.kozlov at oracle.com]
Sent: Tuesday, May 2, 2017 11:20 AM
To: Deshpande, Vivek R; hotspot compiler
Cc: Viswanathan, Sandhya
Subject: Re: RFR(S): 8178811: Minimize the AVX <-> SSE transition penalty on x86
The build failed:
# Internal Error (/scratch/opt/jprt/T/P1/175733.vkozlov/s/hotspot/src/cpu/x86/vm/vm_version_x86.hpp:640), pid=7347, tid=7348 # assert(_cpuid_info.std_cpuid1_eax.bits.family != 0) failed: VM_Version not initialized
V [libjvm.so+0x93a6a0] report_vm_error(char const*, int, char const*, char const*, ...)+0x60 V [libjvm.so+0x14cb1f0] VM_Version_StubGenerator::generate_get_cpu_info()+0x2f20
V [libjvm.so+0x14c7f53] VM_Version::initialize()+0x103
I think it fails at check is_intel() in set_avx_cpuFeatures() or set_evex_cpuFeatures().
Please, build fastdebug JVM for testing.
Vladimir
On 5/2/17 10:59 AM, Vladimir Kozlov wrote:
> Looks good. One nit - move next code into separate function instead of duplicating 3 times:
>
> + if( is_intel() ) { // Intel cpus specific settings
> + if ((cpu_family() == 0x06) &&
> + ((extended_cpu_model() == 0x57) || // Xeon Phi 3200/5200/7200
> + (extended_cpu_model() == 0x85))) { // Future Xeon Phi
> + _features &= ~CPU_VZEROUPPER;
> + }
> + }
>
> I will run testing for current changes and let you know results.
>
> Thanks,
> Vladimir
>
> On 4/20/17 1:54 PM, Deshpande, Vivek R wrote:
>> HI Vladimir
>>
>> We added almost all the vzeroupper instructions by analyzing SPECjbb2015.
>>
>> We need vzeroupper after we execute avx2/avx-512 and before
>> transition to SSE to avoid penalty of saving and restoring higher bits in YMM/ZMM registers. Since JVM is SSE compiled, mixing of AVX and SSE code happens in C1 and interpreter with usage of intrinsics, so I have added vzerouppers at these transitions in the stubs.
>>
>> I think max_vector_size() > 16 would be with auto vectorization, but we need to have vzeroupper with stubs and intrinsics.
>>
>> I have made all the changes which you suggested. Please take a look at the patch and let me know your thoughts.
>>
>> The updated Webrev is here:
>> http://cr.openjdk.java.net/~vdeshpande/8178811/webrev.01/
>>
>> Thank you.
>> Regards,
>> Vivek
>>
>> -----Original Message-----
>> From: Vladimir Kozlov [mailto:vladimir.kozlov at oracle.com]
>> Sent: Friday, April 14, 2017 12:48 PM
>> To: Deshpande, Vivek R; hotspot compiler
>> Cc: Viswanathan, Sandhya
>> Subject: Re: RFR(S): 8178811: Minimize the AVX <-> SSE transition
>> penalty on x86
>>
>> Hi Vivek,
>>
>> Did you pinpoint particular change which helps SPECjbb2015?
>>
>> From what I remember it is mostly during call to runtime or JNI calls.
>>
>> // AVX instruction which is used to clear upper 128 bits of YMM registers and
>> // to avoid transaction penalty between AVX and SSE states. There is no
>> // penalty if legacy SSE instructions are encoded using VEX prefix because
>> // they always clear upper 128 bits. It should be used before calling
>> // runtime code and native libraries.
>> void vzeroupper();
>>
>> So why you added vzeroupper() into arraycopy and other stubs? If AVX is supported all instructions should be encoded with VEX prefix. Or the statement in the comment is incorrect?
>>
>> About UseVzeroupper flag. We should avoid adding new product flags if possible.
>> IT only make sense if you want to do experiments to check performance with and without it. But we already know that it is needed on most CPUs.
>>
>> I would suggest to add VM_Version::supports_vzeroupper() { return
>> (_features & CPU_VZEROUPPER) != 0; } Set CPU_VZEROUPPE if AVX is supported and clear it for Knights CPU. The add check inside assembler instruction:
>>
>> void Assembler::vzeroupper() {
>> - assert(VM_Version::supports_avx(), "");
>> + if (VM_Version::supports_vzeroupper()) {
>>
>> It will allow to avoid supports_avx() checks on each vzeroupper() call.
>>
>> You missed check in MachEpilogNode::emit() in x86_64.ad
>>
>> Note, I used (C->max_vector_size() > 16) checks in .ad files because of the same reasons as above:
>>
>> // There is no penalty if legacy SSE instructions are encoded using VEX prefix because
>> // they always clear upper 128 bits.
>>
>> I thought if vectors are small and we use VEX prefix then upper bits will be 0 anyway so you don't need vzeroupper().
>>
>> Thanks,
>> Vladimir
>>
>> On 4/14/17 11:17 AM, Deshpande, Vivek R wrote:
>>> Hi All
>>>
>>>
>>>
>>> This fix minimizes the AVX to SSE and SSE to AVX transition penalty
>>> through generation of vzeroupper instruction. With this patch we see
>>> zero transitions with penalty per SPECjbb2015 jOPS on BDW and a
>>> significant reduction on SKX CPU event vector width mismatch from 65
>>> to 0.01 per SPECjbb2015 jOPS. We have also implemented an enhancement to disable vzeroupper generation for Knights family where the instruction has high penalty and is not recommended. The option UseVzeroupper is used to control generation of vzeroupper instruction and gets set to false on the Knights family.
>>> We observed ~3% gain on SPECJvm2008 composite result on Skylake.
>>>
>>> Webrev:
>>>
>>> http://cr.openjdk.java.net/~vdeshpande/8178811/webrev.00/
>>>
>>> I have also updated the JBS entry.
>>>
>>> https://bugs.openjdk.java.net/browse/JDK-8178811
>>>
>>> Would you please review and sponsor it.
>>>
>>>
>>>
>>> Regards,
>>>
>>> Vivek
>>>
>>>
>>>
More information about the hotspot-compiler-dev
mailing list