RFR (M) 8207746:C2: Lucene crashes on AVX512 instruction

Fri Sep 14 19:12:45 UTC 2018

I got build failure:

workspace/open/src/hotspot/cpu/x86/c1_FrameMap_x86.cpp:236:3: error: array index 16 is past the end 
of the array (which contains 16 elements) [-Werror,-Warray-bounds]
jib >   _xmm_regs[16]  = xmm16;

I also noticed that we don't have RFE for this work. I filed:

https://bugs.openjdk.java.net/browse/JDK-8209735

You did not enabled avx512 by default (8209735 change was synced from jdk 11 into 12 2 weeks ago). I 
added next change to yours in src/hotspot/cpu/x86/globals_x86.hpp:

- product(intx, UseAVX, 2, \
+ product(intx, UseAVX, 3, \

Thanks,
Vladimir

On 9/14/18 11:39 AM, Vladimir Kozlov wrote:
> Looks good to me. I will start testing and let you know results.
> 
> Thanks,
> Vladimir
> 
> On 9/13/18 6:05 AM, Viswanathan, Sandhya wrote:
>> Hi Vladimir,
>>
>> Please find below the updated webrev with all your comments incorporated:
>>
>> http://cr.openjdk.java.net/~vdeshpande/xmm_reg/webrev.01/
>>
>> I have run the jtreg compiler tests on SKX and KNL which have two different flavors of AVX512 and 
>> Haswell a non-avx512 system. Also tested SPECjvm2008 on the three platforms.
>>
>> Best Regards,
>> Sandhya
>>
>>
>> -----Original Message-----
>> From: Vladimir Kozlov [mailto:vladimir.kozlov at oracle.com]
>> Sent: Tuesday, September 11, 2018 8:54 PM
>> To: Viswanathan, Sandhya <sandhya.viswanathan at intel.com>; hotspot-compiler-dev at openjdk.java.net
>> Subject: Re: RFR (M) 8207746:C2: Lucene crashes on AVX512 instruction
>>
>> Thank you, Sandhya
>>
>> I am satisfied with your detailed answer for memory loads issues. Okay lets not add them.
>>
>> Vladimir
>>
>> On 9/11/18 6:13 PM, Viswanathan, Sandhya wrote:
>>> Hi Vladimir,
>>>
>>> Thanks a lot for the detailed review. I really appreciate your feedback.
>>> Please see my response in your email below marked with (Sandhya >>>). Looking forward to your 
>>> advice.
>>>
>>> Best Regards,
>>> Sandhya
>>>
>>>
>>> -----Original Message-----
>>> From: Vladimir Kozlov [mailto:vladimir.kozlov at oracle.com]
>>> Sent: Tuesday, September 11, 2018 5:11 PM
>>> To: Viswanathan, Sandhya <sandhya.viswanathan at intel.com>; hotspot-compiler-dev at openjdk.java.net
>>> Subject: Re: RFR (M) 8207746:C2: Lucene crashes on AVX512 instruction
>>>
>>> Thank you.
>>>
>>> I want to discuss next issue:
>>>
>>>    > You did not added instructions to load these registers from memory (and stack). What happens 
>>> in such cases when you need to load or store?
>>>    >>>> Let us take an example, e.g. for loading into rregF. First it gets loaded from memory 
>>> into regF and then register to register move to rregF and vice versa.
>>>
>>> This is what I thought. You increase registers pressure this way which may cause spills on stack.
>>> Also we don't check that register could be the same as result you may get unneeded moves.
>>>
>>> I would advice add memory moves at least.
>>>
>>> Sandhya >>>  I had added those rules initially and removed them in the final patch. I noticed 
>>> that the register allocator uses the memory rules (e.g. LoadF) to initialize the idealreg2reg 
>>> mask (matcher.cpp). I would like the register allocator to get all the possible register on an 
>>> architecture for idealreg2reg mask. I wondered that multiple instruct rules in .ad file for LoadF 
>>> from memory might cause problems.  I would have to have higher cost for loading into restricted 
>>> register set like vlReg. Then I decided that the register allocator can handle this in much 
>>> better way than me adding rules to load from memory. This is with the background that the regF is 
>>> always all the available registers and vlRegF is the restricted register set. Likewise for VecS 
>>> and legVecS. Let me know you thoughts on this and if I should still add the rules to load from 
>>> memory into vlReg and legVec. The specific code from matcher.cpp that I am referring to is:
>>>     MachNode *spillCP = match_tree(new 
>>> LoadNNode(NULL,mem,fp,atp,TypeInstPtr::BOTTOM,MemNode::unordered));
>>> #endif
>>>     MachNode *spillI  = match_tree(new LoadINode(NULL,mem,fp,atp,TypeInt::INT,MemNode::unordered));
>>>     MachNode *spillL  = match_tree(new 
>>> LoadLNode(NULL,mem,fp,atp,TypeLong::LONG,MemNode::unordered, LoadNode::DependsO
>>> nlyOnTest, false));
>>>     MachNode *spillF  = match_tree(new LoadFNode(NULL,mem,fp,atp,Type::FLOAT,MemNode::unordered));
>>>     MachNode *spillD  = match_tree(new LoadDNode(NULL,mem,fp,atp,Type::DOUBLE,MemNode::unordered));
>>>     MachNode *spillP  = match_tree(new 
>>> LoadPNode(NULL,mem,fp,atp,TypeInstPtr::BOTTOM,MemNode::unordered));
>>>     ....
>>>     idealreg2regmask[Op_RegF] = &spillF->out_RegMask();
>>>
>>> An other question. You use movflt() and movdbl() which use either movap[s|d] and movs[s|d] 
>>> instructions:
>>> http://hg.openjdk.java.net/jdk/jdk/file/4ffb0a33f265/src/hotspot/cpu/x86/macroAssembler_x86.hpp#l164
>>> Are these instructions work when avx512vl is not available? I see for vectors you use
>>> vpxor+vinserti* combination.
>>>
>>> Sandhya >>> Yes the scalar floating point instructions are available with AVX512 encoding when 
>>> avx512vl is not available. That is why you would see not just movflt, movdbl but all the other 
>>> scalar operations like adds, addsd etc using the entire xmm range (xmm0-31). In other words they 
>>> are AVX512F instructions.
>>>
>>> Last question. I notice next UseAVX check in vectors spills code in x86.ad:
>>> if ((UseAVX < 2) || VM_Version::supports_avx512vl())
>>>
>>> Should it be (UseAVX < 3)?
>>>
>>> Sandhya >>> Yes, that is a very good catch. I will fix this in the updated webrev.
>>>
>>> Thanks,
>>> Vladimir
>>>
>>> On 9/11/18 2:58 PM, Viswanathan, Sandhya wrote:
>>>> Hi Vladimir,
>>>>
>>>> Thanks a lot for your review and feedback. Please see my response in your email below. I will 
>>>> send an updated webrev incorporating your feedback.
>>>>
>>>> Best Regards,
>>>> Sandhya
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: Vladimir Kozlov [mailto:vladimir.kozlov at oracle.com]
>>>> Sent: Monday, September 10, 2018 6:09 PM
>>>> To: Viswanathan, Sandhya <sandhya.viswanathan at intel.com>;
>>>> hotspot-compiler-dev at openjdk.java.net
>>>> Subject: Re: RFR (M) 8207746:C2: Lucene crashes on AVX512 instruction
>>>>
>>>> Very nice. Thank you, Sandhya.
>>>>
>>>> I would like to see more meaningful naming in .ad files - instead of rreg* and ovec* to have 
>>>> something like vlReg* and legVec*.
>>>>
>>>>>>> Yes, accepted.
>>>>
>>>> New load_from_* and load_to_* instructions in .ad files should be renamed to next and collocate 
>>>> with other Move*_reg_reg* instructions:
>>>>
>>>> instruct MoveF2VL(vlRegF dst, regF src) instruct MoveVL2F(regF dst,
>>>> vlRegF src)
>>>>>>> Yes, accepted.
>>>>
>>>> You did not added instructions to load these registers from memory (and stack). What happens in 
>>>> such cases when you need to load or store?
>>>>>>> Let us take an example, e.g. for loading into rregF. First it gets loaded from memory into 
>>>>>>> regF and then register to register move to rregF and vice versa.
>>>>
>>>> Also please explain why these registers are used when UseAVX == 0?:
>>>>
>>>> +instruct absD_reg(rregD dst) %{
>>>>        predicate((UseSSE>=2) && (UseAVX == 0));
>>>>
>>>> we switch off evex so regular regD (only legacy register in this case) should work too:
>>>>      661   if (UseAVX < 3) {
>>>>      662     _features &= ~CPU_AVX512F;
>>>>
>>>>>>> Yes, accepted. It could be regD here.
>>>>
>>>> Next checks could be combined by using new function in vm_version_x86.hpp (you already have some):
>>>>
>>>> +reg_class_dynamic vectors_reg_bwdqvl(vectors_reg_evex,
>>>> +vectors_reg_legacy, %{
>>>> VM_Version::supports_evex() && VM_Version::supports_avx512bw() &&
>>>> VM_Version::supports_avx512dq() &&
>>>> VM_Version::supports_avx512vl() %} );
>>>>
>>>>>>> Yes, accepted.
>>>>
>>>> I would suggest to test these changes on different machines (non-avx512 and avx512) and with 
>>>> different UseAVX values.
>>>>
>>>>>>> Will do.
>>>>
>>>> Thanks,
>>>> Vladimir
>>>>
>>>> On 9/5/18 4:09 PM, Viswanathan, Sandhya wrote:
>>>>> Recently there have been couple of high priority issues with regards
>>>>> to high bank of XMM register
>>>>> (XMM16-XMM31) usage by C2:
>>>>>
>>>>> https://bugs.openjdk.java.net/browse/JDK-8207746
>>>>>
>>>>> https://bugs.openjdk.java.net/browse/JDK-8209735
>>>>>
>>>>> Please find below a patch which attempts to clean up the XMM register handling by using 
>>>>> register groups.
>>>>>
>>>>> http://cr.openjdk.java.net/~vdeshpande/xmm_reg/webrev.00/
>>>>> <http://cr.openjdk.java.net/%7Evdeshpande/xmm_reg/webrev.00/>
>>>>>
>>>>> The patch provides a restricted set of registers to the match rules
>>>>> in the ad file based on the underlying architecture.
>>>>>
>>>>> The aim is to remove special handling/workaround from macro assembler and assembler.
>>>>>
>>>>> By removing the special handling, the patch reduces the overall code size by about 1800 lines 
>>>>> of code.
>>>>>
>>>>> Your review and feedback is very welcome.
>>>>>
>>>>> Best Regards,
>>>>>
>>>>> Sandhya
>>>>>