RFR (M) 8222074: Enhance auto vectorization for x86

Wed Apr 10 01:19:10 UTC 2019

On 4/9/19 6:04 PM, Viswanathan, Sandhya wrote:
> Hi Vladimir,
> 
> Yes, I missed the question below:
>>> There are cases where we can use less `TEMP tmp` registers by using 'dst' register like in mul4B_reg(). Is it intentional to not use 'dst' there?
> 
> No it is not intentional, we can use the dst register in those cases and reduced the tmps.
> 
> Thanks a lot for clarifying on vec_mov_helper(), it is much clearer now what you have in mind. But the code generated many times differs with vector size so it may not help in reducing the libjvm code size. I will explore all possible ways to reduce code size increase.

You can "cheat" ;-) by optimizing existing code to get under size budget for new code. Do it as 
separate RFE.

Vladimir

> 
> Best Regards,
> Sandhya
> 
> 
> -----Original Message-----
> From: Vladimir Kozlov [mailto:vladimir.kozlov at oracle.com]
> Sent: Tuesday, April 09, 2019 5:58 PM
> To: Viswanathan, Sandhya <sandhya.viswanathan at intel.com>; hotspot-compiler-dev at openjdk.java.net
> Subject: Re: RFR (M) 8222074: Enhance auto vectorization for x86
> 
> On 4/9/19 4:59 PM, Viswanathan, Sandhya wrote:
>> Hi Vladimir,
>>
>> Please see my answers in your email below.
> 
> My comments below too.
> 
>>
>> Best Regards,
>> Sandhya
>>
>>
>> -----Original Message-----
>> From: Vladimir Kozlov [mailto:vladimir.kozlov at oracle.com]
>> Sent: Tuesday, April 09, 2019 12:34 PM
>> To: Viswanathan, Sandhya <sandhya.viswanathan at intel.com>;
>> hotspot-compiler-dev at openjdk.java.net
>> Subject: Re: RFR (M) 8222074: Enhance auto vectorization for x86
>>
>> Hi Sandhya,
>>
>> I looked through changes and had discussion with Vladimir Ivanov about them.
>> In general logic of changes follow out usual pattern - no problem here.
>>
>> There are cases where we can use less `TEMP tmp` registers by using 'dst' register like in mul4B_reg(). Is it intentional to not use 'dst' there?
> 
> What about this question?
> 
>>
>> But my main concern now is JVM size significant grow. And it will be worse when we implementing the rest of Vector API instructions.
>>
>> The main reason for size grow is additional AD instructions which are compiled by APLC into multiply functions. I originally thought that moving 'ins_encode %{ %}' code which have several instructions into macro-assembler will help if it is used by several AD instructions. But Vladimir I. convinced me that it will be insignificant comparing to reducing number of AD instructions.
>>
>> We came up with several suggestion how we can address it and it will greatly help if you (Intel) investigate them.
>>
>> 1. I think you should provide JVM size increase data for changes like this. What is increase for this one?
>>
>> Sandhya >> The size increase is about 1.5%:
>>                         libjvm.so                            : 24358297
>>                         libjvm.so + 8222074        : 24718043 (1.5%)
> 
> I think it is not a little - you implemented only 4 operations.
> 
> May be we should follow Intel's rule: you can add new instructions only if you reduce power consumption (size in our case). ;-)
> 
>>                        
>> 2. How is important for Intel to support new vector instructions for CPU without AVX? May be we should stop new code for old CPUs such as vsll16B_reg, for exaxmple. It does not mean we can't use SSE instructions in implementation (for example, vabs8B_reg) - such cases are fine.
>>
>> Sandhya >> Definitely AVX is higher priority, we can try to merge as many rules as possible on similar lines as vabs8B_reg.
> 
> Okay.
> 
>>
>> 3. I still want to see some common instructions pattern in 'ins_encode %{ %}' be moved into macro-assembler. For example, the only difference between vs*_reg and vs*_reg_imm is one or 2 instructions, the rest is the same.
>>
>> Sandhya >> This is due to the peculiarity of Shift count handling in the superword, only when the count is not immediate it goes through RShiftCntV/LShiftCntV. It should be possible to merge these rules with some tweaks to superword.cpp. Since this was common part of the code, I didn’t want to change so as to minimize effect on other architecture. I can take a closer look and clean that up.
> 
> Okay.
> 
>>
>> 4. Most important. The main reason we have a lot AD instructions is to 'match' different vector types for corresponding different vector length. I think we should revisit this approach.
>>
>> Intel CPU does not use parts of vector registers separately - C2 does not use XMM0b, XMM0c, XMM0d parts of xmm0. Even when C2 uses VecS type it use whole zmm register in avx512 but narrowed it by passing length to assembler instruction (or we use an instruction which uses only part of 512 bit register).
>>
>> Vladimir I. suggested to have VecMAX type which can be used to match all different vector length implementation to have only one AD instruction. And use vector length to generate corresponding code. For example, vabs8B_reg() and vabs16B_reg() are almost the same except vectors type VecD vs VecX. There should be no difference in code generation (we need to modify vec_mov_helper() and other similar code to check vector length when it see VecMAX).
>>
>> We can use this approach for already existing instructions too to reduce code size generated from AD files.
>>
>> What do you think?
>>
>> Sandhya >> Using vecMAX will lead to spill/fill code using the largest
>> vector width which is not recommended on Intel
> 
> That is why I added comment about vec_mov_helper(). This function is used for generating spills. You definitely should not save whole 512 bits register but only part corresponding to byte size of vector.
> Note, when I talked about vector length I meant length_in_bytes().
> 
> architecture. The better way to club or merge then would be as John suggested with binary/unary op for same register type. We will need to take into account the temporaries needed etc while clubbing the rules so as to not degrade the generated code
> 
> I am not sure this approach is simpler in short term but I can't say what would be better in long run. This is needed to explore.
> 
> unnecessarily. Some questions that will help me is how much redesign we want? Only for new code? Or also the existing code?  Also
> 
> In long term we should update all existing code too.
> 
> is the libjvm size our criteria or ad file size or both?  I will need a lot of support from you and Vlaidmir Ivanov if we need to
> 
> The main goal is reduce libjvm size. But we should keep other platforms in mind.
> 
> get this redesign through in reviews and sponsorship over the coming month or so.  I will explore a small protoype and submit that first and we could then go from there.
> 
> Thanks,
> Vladimir
> 
>>
>> Regards,
>> Vladimir K
>>
>> On 4/9/19 10:18 AM, Viswanathan, Sandhya wrote:
>>> Hi Yang,
>>>
>>> Thanks a lot for trying out the patch in your setup.
>>>
>>> Please do let me know when you check the details if you find the failure in DivideMvTests.java to be due to this patch.
>>>
>>> I will fix all the trailing space and unaligned line style issues that you pointed out.
>>>
>>> The TestInt is updated to cover for some additional support added for "int" in this patch like Absolute and subtraction from zero.
>>> There is an additional test for Not for which we plan to add support in a follow up patch.
>>>
>>> Best Regards,
>>> Sandhya
>>>
>>>
>>> -----Original Message-----
>>> From: Yang Zhang (Arm Technology China) [mailto:Yang.Zhang at arm.com]
>>> Sent: Tuesday, April 09, 2019 1:04 AM
>>> To: Viswanathan, Sandhya <sandhya.viswanathan at intel.com>;
>>> hotspot-compiler-dev at openjdk.java.net
>>> Subject: RE: RFR (M) 8222074: Enhance auto vectorization for x86
>>>
>>> Hi Sandhya
>>>
>>> Thanks for proposing this enhancement.
>>> I have tested this patch in our internal ci. There is a new failure. But I didn't check the details.
>>> java/math/BigDecimal/DivideMcTests.java
>>>
>>> In addition,  there are trailing spaces in the following files.
>>> src/hotspot/cpu/x86/assembler_x86.cpp
>>> src/hotspot/cpu/x86/stubGenerator_x86_32.cpp
>>> src/hotspot/cpu/x86/x86.ad
>>> src/hotspot/cpu/x86/x86_32.ad
>>> src/hotspot/share/opto/superword.cpp
>>>
>>> In file src/hotspot/share/classfile/vmSymbols.hpp, there are some unaligned lines.
>>> In file test/hotspot/jtreg/compiler/c2/cr6340864/TestIntVect.java, there are new test functions. Are these new functions needed by byte/short/long?
>>>
>>> Regards,
>>> Yang
>>>
>>>
>>> -----Original Message-----
>>> From: hotspot-compiler-dev
>>> <hotspot-compiler-dev-bounces at openjdk.java.net> On Behalf Of
>>> Viswanathan, Sandhya
>>> Sent: Saturday, April 6, 2019 9:18 AM
>>> To: hotspot-compiler-dev at openjdk.java.net; Vladimir Kozlov
>>> <vladimir.kozlov at oracle.com>
>>> Subject: RFR (M) 8222074: Enhance auto vectorization for x86
>>>
>>>
>>> Please find below a link to the webrev which enhances super-word auto vectorization for x86.
>>> The following additional operations are supported:
>>>
>>> 1)      Absolute for all data types
>>>
>>> 2)      Shifts for byte data types
>>>
>>> 3)      Shift right arithmetic for long data type
>>>
>>> 4)      Byte multiply
>>>
>>> 5)      Negate for float/double
>>>
>>> JBS: https://bugs.openjdk.java.net/browse/JDK-8222074
>>> Webrev: http://cr.openjdk.java.net/~sviswanathan/8222074/webrev.00/
>>>
>>> The compiler jtreg tests pass with UseAVX=0,1,2,3 and KNL.
>>> Your review and comments are welcome.
>>>
>>> Best Regards,
>>> Sandhya
>>>
>>> IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
>>>