RFR (M) 8222074: Enhance auto vectorization for x86

Tue Apr 9 19:33:51 UTC 2019

Hi Sandhya,

I looked through changes and had discussion with Vladimir Ivanov about them.
In general logic of changes follow out usual pattern - no problem here.

There are cases where we can use less `TEMP tmp` registers by using 'dst' register like in 
mul4B_reg(). Is it intentional to not use 'dst' there?

But my main concern now is JVM size significant grow. And it will be worse when we implementing the 
rest of Vector API instructions.

The main reason for size grow is additional AD instructions which are compiled by APLC into multiply 
functions. I originally thought that moving 'ins_encode %{ %}' code which have several instructions 
into macro-assembler will help if it is used by several AD instructions. But Vladimir I. convinced 
me that it will be insignificant comparing to reducing number of AD instructions.

We came up with several suggestion how we can address it and it will greatly help if you (Intel) 
investigate them.

1. I think you should provide JVM size increase data for changes like this. What is increase for 
this one?

2. How is important for Intel to support new vector instructions for CPU without AVX? May be we 
should stop new code for old CPUs such as vsll16B_reg, for exaxmple. It does not mean we can't use 
SSE instructions in implementation (for example, vabs8B_reg) - such cases are fine.

3. I still want to see some common instructions pattern in 'ins_encode %{ %}' be moved into 
macro-assembler. For example, the only difference between vs*_reg and vs*_reg_imm is one or 2 
instructions, the rest is the same.

4. Most important. The main reason we have a lot AD instructions is to 'match' different vector 
types for corresponding different vector length. I think we should revisit this approach.

Intel CPU does not use parts of vector registers separately - C2 does not use XMM0b, XMM0c, XMM0d 
parts of xmm0. Even when C2 uses VecS type it use whole zmm register in avx512 but narrowed it by 
passing length to assembler instruction (or we use an instruction which uses only part of 512 bit 
register).

Vladimir I. suggested to have VecMAX type which can be used to match all different vector length 
implementation to have only one AD instruction. And use vector length to generate corresponding 
code. For example, vabs8B_reg() and vabs16B_reg() are almost the same except vectors type VecD vs 
VecX. There should be no difference in code generation (we need to modify vec_mov_helper() and other 
similar code to check vector length when it see VecMAX).

We can use this approach for already existing instructions too to reduce code size generated from AD 
files.

What do you think?

Regards,
Vladimir K

On 4/9/19 10:18 AM, Viswanathan, Sandhya wrote:
> Hi Yang,
> 
> Thanks a lot for trying out the patch in your setup.
> 
> Please do let me know when you check the details if you find the failure in DivideMvTests.java to be due to this patch.
> 
> I will fix all the trailing space and unaligned line style issues that you pointed out.
> 
> The TestInt is updated to cover for some additional support added for "int" in this patch like Absolute and subtraction from zero.
> There is an additional test for Not for which we plan to add support in a follow up patch.
> 
> Best Regards,
> Sandhya
> 
> 
> -----Original Message-----
> From: Yang Zhang (Arm Technology China) [mailto:Yang.Zhang at arm.com]
> Sent: Tuesday, April 09, 2019 1:04 AM
> To: Viswanathan, Sandhya <sandhya.viswanathan at intel.com>; hotspot-compiler-dev at openjdk.java.net
> Subject: RE: RFR (M) 8222074: Enhance auto vectorization for x86
> 
> Hi Sandhya
> 
> Thanks for proposing this enhancement.
> I have tested this patch in our internal ci. There is a new failure. But I didn't check the details.
> java/math/BigDecimal/DivideMcTests.java
> 
> In addition,  there are trailing spaces in the following files.
> src/hotspot/cpu/x86/assembler_x86.cpp
> src/hotspot/cpu/x86/stubGenerator_x86_32.cpp
> src/hotspot/cpu/x86/x86.ad
> src/hotspot/cpu/x86/x86_32.ad
> src/hotspot/share/opto/superword.cpp
> 
> In file src/hotspot/share/classfile/vmSymbols.hpp, there are some unaligned lines.
> In file test/hotspot/jtreg/compiler/c2/cr6340864/TestIntVect.java, there are new test functions. Are these new functions needed by byte/short/long?
> 
> Regards,
> Yang
> 
> 
> -----Original Message-----
> From: hotspot-compiler-dev <hotspot-compiler-dev-bounces at openjdk.java.net> On Behalf Of Viswanathan, Sandhya
> Sent: Saturday, April 6, 2019 9:18 AM
> To: hotspot-compiler-dev at openjdk.java.net; Vladimir Kozlov <vladimir.kozlov at oracle.com>
> Subject: RFR (M) 8222074: Enhance auto vectorization for x86
> 
> 
> Please find below a link to the webrev which enhances super-word auto vectorization for x86.
> The following additional operations are supported:
> 
> 1)      Absolute for all data types
> 
> 2)      Shifts for byte data types
> 
> 3)      Shift right arithmetic for long data type
> 
> 4)      Byte multiply
> 
> 5)      Negate for float/double
> 
> JBS: https://bugs.openjdk.java.net/browse/JDK-8222074
> Webrev: http://cr.openjdk.java.net/~sviswanathan/8222074/webrev.00/
> 
> The compiler jtreg tests pass with UseAVX=0,1,2,3 and KNL.
> Your review and comments are welcome.
> 
> Best Regards,
> Sandhya
> 
> IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
>