[vectorIntrinsics+mask] RFR: 8273406: Optimize various masked vector operations for AVX512 target.

Mon Sep 13 06:14:29 UTC 2021

On Thu, 9 Sep 2021 22:52:59 GMT, Sandhya Viswanathan <sviswanathan at openjdk.org> wrote:

>> This patch is in continuation to X86 backend support for optimizing masked operations over AVX-512 targets (JDK-8262356).
>> 
>> Summary of changes:
>> 
>> 1) Support for masked rotate left and right operations over integer/long vectors.
>> 
>> 2) Support for masked square root operation over float/double vectors.
>> 
>> 3) Support for masked logical shiftleft and logical/arithmetic shiftright operation with constant shift count.
>> 
>> 4) Optimized VectorMask.not operation by emitting direct KNOT instruction.
>> 
>> 5) Extended masking optimization support for X86 KNL target which has limited set of AVX-512 features.
>> 
>>       - Currently vector type associated with VectorLoadMask operation is created during parsing stage.
>>         For targets supporting opmask registers, lane type is explicitly set to BOOLEAN irrespective of the primitive
>>         type of species i.e. for Int512 species ideal type TypeVectMask(16,BOOL) represent vector of 16 BOOLEAN elements
>>         each of which represent a mask bit for corresponding vector lane.
>>         This type information is also associated with respective mask boxes (Int512Mask).
>>     
>>       - During macro expansion vbox/vunbox nodes are broken down into granular target mappable ideal nodes.
>> 
>>           ``` 
>>               VectorBoxNode   -> VectorStoreMask + StoreVector
>> 
>>               VectorUnboxNode -> LoadVector + VectorLoadMask 
>>           ```
>> 
>>          At this stage vector type (TypeVectMask(16,BOOL)) earlier associated with vunbox node is used to create the
>>         type for VectorLoadMask operation.
>>     
>>       - Masks can be propagated either though a vector (non-AVX512 targets) or using opmask registers (K1-K7).
>>         Decision to create correct ideal type based on the target features is delegated to low level
>>         type creation routine TypeVect::makemask.
>>     
>>       - This creates problem for targets like KNL which support limited set of AVX-512 features i.e. do
>>         no support AVX512VL and AVX512BW feature.
>>     
>>       - For Int512 species initial ideal type constructed during parsing is based on primitive type and
>>         lane count associated with species, but during macro expansion type creation
>>         decision is based on vector type associated with v[u]box nodes i.e. TypeVectoMask(16,BOOL),
>>         thus for KNL target incorrect vector mask type TypeVectX(16,BOOL) gets created since it does not
>>         support vector length extension(128,256 bit operation over EVEX encoded instruction).
>>     
>>       - There are multiple ways to fix this discrepancy, cleanest approach is to create ideal type TypeVectoMask 
>>         based on the primitive lane type of the species, instead of always setting the lane type as BOOLEAN.
>>         This will also preserve the original lane type information which was needed in some cases e.g.
>>         reinterpretation operation over mask. To circumvent such issue explicit src/dst primitive types
>>         were added to ideal nodes.
>>     
>>       - Also this does not disturbs the register mask and spilling behavior associated with opmask registers
>>         thus the change is transparent to backend passes.
>> 	
>> Validation:
>> Patch regressed through tier1-3 tests at AVX Level=0,1,2,3 and UseKNLSetting
>
> src/hotspot/cpu/x86/x86.ad line 1970:
> 
>> 1968:     case Op_OrVMask:
>> 1969:     case Op_XorVMask:
>> 1970:       if (vlen > 16 && !VM_Version::supports_avx512bw()) {
> 
> Isn't there a limitation as well for vlen 8 and DQ support?

That is being taken care in the instruction pattern where we pick masklength to be 16 if target does not support DQ.

> src/hotspot/cpu/x86/x86.ad line 9360:
> 
>> 9358: %}
>> 9359: 
>> 9360: instruct mask_not_immLT8(kReg dst, kReg src, rRegI rtmp, kReg ktmp, immI_M1 cnt) %{
> 
> What happens to not operation if vector_length < 8 and it is a !avx512dq platform?

In that case this pattern will not get matched since instruction sequence will still be same.

Following two implementations are possible for targets without DQ:
A) XorVMask  SRC (MaskAll -1):
  Instruction sequence
  KMOV -1 SRC1
  KSHIFTRL 16-masklen, SRC1
  KXOR SRC1, DST
B) New instruction for NOT operation.
In this case following will be generated instruction sequence
KNOTW SRC1
KMOV  FILTER 3/15  
KAND SRC FILTER

Apparently there is not much difference in the two implementations.

> src/hotspot/cpu/x86/x86.ad line 9375:
> 
>> 9373:   predicate((Matcher::vector_length(n) == 8 && VM_Version::supports_avx512dq()) ||
>> 9374:             (Matcher::vector_length(n) == 16) ||
>> 9375:             (Matcher::vector_length(n) > 16 && VM_Version::supports_avx512bw()));
> 
> What happens to not operation if vector_length == 8 and it is a !avx512dq platform?

Same as above

-------------

PR: https://git.openjdk.java.net/panama-vector/pull/122