RFR (M) 8222074: Enhance auto vectorization for x86

Tue Apr 9 23:59:16 UTC 2019

Hi Vladimir,

Please see my answers in your email below.

Best Regards,
Sandhya

-----Original Message-----
From: Vladimir Kozlov [mailto:vladimir.kozlov at oracle.com] 
Sent: Tuesday, April 09, 2019 12:34 PM
To: Viswanathan, Sandhya <sandhya.viswanathan at intel.com>; hotspot-compiler-dev at openjdk.java.net
Subject: Re: RFR (M) 8222074: Enhance auto vectorization for x86

Hi Sandhya,

I looked through changes and had discussion with Vladimir Ivanov about them.
In general logic of changes follow out usual pattern - no problem here.

There are cases where we can use less `TEMP tmp` registers by using 'dst' register like in mul4B_reg(). Is it intentional to not use 'dst' there?

But my main concern now is JVM size significant grow. And it will be worse when we implementing the rest of Vector API instructions.

The main reason for size grow is additional AD instructions which are compiled by APLC into multiply functions. I originally thought that moving 'ins_encode %{ %}' code which have several instructions into macro-assembler will help if it is used by several AD instructions. But Vladimir I. convinced me that it will be insignificant comparing to reducing number of AD instructions.

We came up with several suggestion how we can address it and it will greatly help if you (Intel) investigate them.

1. I think you should provide JVM size increase data for changes like this. What is increase for this one?

Sandhya >> The size increase is about 1.5%:
                      libjvm.so                            : 24358297
                      libjvm.so + 8222074        : 24718043 (1.5%)

2. How is important for Intel to support new vector instructions for CPU without AVX? May be we should stop new code for old CPUs such as vsll16B_reg, for exaxmple. It does not mean we can't use SSE instructions in implementation (for example, vabs8B_reg) - such cases are fine.

Sandhya >> Definitely AVX is higher priority, we can try to merge as many rules as possible on similar lines as vabs8B_reg.

3. I still want to see some common instructions pattern in 'ins_encode %{ %}' be moved into macro-assembler. For example, the only difference between vs*_reg and vs*_reg_imm is one or 2 instructions, the rest is the same.

Sandhya >> This is due to the peculiarity of Shift count handling in the superword, only when the count is not immediate it goes through RShiftCntV/LShiftCntV. It should be possible to merge these rules with some tweaks to superword.cpp. Since this was common part of the code, I didn’t want to change so as to minimize effect on other architecture. I can take a closer look and clean that up.

4. Most important. The main reason we have a lot AD instructions is to 'match' different vector types for corresponding different vector length. I think we should revisit this approach.

Intel CPU does not use parts of vector registers separately - C2 does not use XMM0b, XMM0c, XMM0d parts of xmm0. Even when C2 uses VecS type it use whole zmm register in avx512 but narrowed it by passing length to assembler instruction (or we use an instruction which uses only part of 512 bit register).

Vladimir I. suggested to have VecMAX type which can be used to match all different vector length implementation to have only one AD instruction. And use vector length to generate corresponding code. For example, vabs8B_reg() and vabs16B_reg() are almost the same except vectors type VecD vs VecX. There should be no difference in code generation (we need to modify vec_mov_helper() and other similar code to check vector length when it see VecMAX).

We can use this approach for already existing instructions too to reduce code size generated from AD files.

What do you think?

Sandhya >> Using vecMAX will lead to spill/fill code using the largest vector width which is not recommended on Intel architecture. The better way to club or merge then would be as John suggested with binary/unary op for same register type. We will need to take into account the temporaries needed etc while clubbing the rules so as to not degrade the generated code unnecessarily. Some questions that will help me is how much redesign we want? Only for new code? Or also the existing code?  Also is the libjvm size our criteria or ad file size or both?  I will need a lot of support from you and Vlaidmir Ivanov if we need to get this redesign through in reviews and sponsorship over the coming month or so.  I will explore a small protoype and submit that first and we could then go from there.

Regards,
Vladimir K

On 4/9/19 10:18 AM, Viswanathan, Sandhya wrote:
> Hi Yang,
> 
> Thanks a lot for trying out the patch in your setup.
> 
> Please do let me know when you check the details if you find the failure in DivideMvTests.java to be due to this patch.
> 
> I will fix all the trailing space and unaligned line style issues that you pointed out.
> 
> The TestInt is updated to cover for some additional support added for "int" in this patch like Absolute and subtraction from zero.
> There is an additional test for Not for which we plan to add support in a follow up patch.
> 
> Best Regards,
> Sandhya
> 
> 
> -----Original Message-----
> From: Yang Zhang (Arm Technology China) [mailto:Yang.Zhang at arm.com]
> Sent: Tuesday, April 09, 2019 1:04 AM
> To: Viswanathan, Sandhya <sandhya.viswanathan at intel.com>; 
> hotspot-compiler-dev at openjdk.java.net
> Subject: RE: RFR (M) 8222074: Enhance auto vectorization for x86
> 
> Hi Sandhya
> 
> Thanks for proposing this enhancement.
> I have tested this patch in our internal ci. There is a new failure. But I didn't check the details.
> java/math/BigDecimal/DivideMcTests.java
> 
> In addition,  there are trailing spaces in the following files.
> src/hotspot/cpu/x86/assembler_x86.cpp
> src/hotspot/cpu/x86/stubGenerator_x86_32.cpp
> src/hotspot/cpu/x86/x86.ad
> src/hotspot/cpu/x86/x86_32.ad
> src/hotspot/share/opto/superword.cpp
> 
> In file src/hotspot/share/classfile/vmSymbols.hpp, there are some unaligned lines.
> In file test/hotspot/jtreg/compiler/c2/cr6340864/TestIntVect.java, there are new test functions. Are these new functions needed by byte/short/long?
> 
> Regards,
> Yang
> 
> 
> -----Original Message-----
> From: hotspot-compiler-dev 
> <hotspot-compiler-dev-bounces at openjdk.java.net> On Behalf Of 
> Viswanathan, Sandhya
> Sent: Saturday, April 6, 2019 9:18 AM
> To: hotspot-compiler-dev at openjdk.java.net; Vladimir Kozlov 
> <vladimir.kozlov at oracle.com>
> Subject: RFR (M) 8222074: Enhance auto vectorization for x86
> 
> 
> Please find below a link to the webrev which enhances super-word auto vectorization for x86.
> The following additional operations are supported:
> 
> 1)      Absolute for all data types
> 
> 2)      Shifts for byte data types
> 
> 3)      Shift right arithmetic for long data type
> 
> 4)      Byte multiply
> 
> 5)      Negate for float/double
> 
> JBS: https://bugs.openjdk.java.net/browse/JDK-8222074
> Webrev: http://cr.openjdk.java.net/~sviswanathan/8222074/webrev.00/
> 
> The compiler jtreg tests pass with UseAVX=0,1,2,3 and KNL.
> Your review and comments are welcome.
> 
> Best Regards,
> Sandhya
> 
> IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
>