[vectorIntrinsics+mask] RFR: 8273406: Optimize various masked vector operations for AVX512 target.

Thu Sep 9 22:58:19 UTC 2021

On Tue, 7 Sep 2021 14:53:25 GMT, Jatin Bhateja <jbhateja at openjdk.org> wrote:

> This patch is in continuation to X86 backend support for optimizing masked operations over AVX-512 targets (JDK-8262356).
> 
> Summary of changes:
> 
> 1) Support for masked rotate left and right operations over integer/long vectors.
> 
> 2) Support for masked square root operation over float/double vectors.
> 
> 3) Support for masked logical shiftleft and logical/arithmetic shiftright operation with constant shift count.
> 
> 4) Optimized VectorMask.not operation by emitting direct KNOT instruction.
> 
> 5) Extended masking optimization support for X86 KNL target which has limited set of AVX-512 features.
> 
>       - Currently vector type associated with VectorLoadMask operation is created during parsing stage.
>         For targets supporting opmask registers, lane type is explicitly set to BOOLEAN irrespective of the primitive
>         type of species i.e. for Int512 species ideal type TypeVectMask(16,BOOL) represent vector of 16 BOOLEAN elements
>         each of which represent a mask bit for corresponding vector lane.
>         This type information is also associated with respective mask boxes (Int512Mask).
>     
>       - During macro expansion vbox/vunbox nodes are broken down into granular target mappable ideal nodes.
> 
>           ``` 
>               VectorBoxNode   -> VectorStoreMask + StoreVector
> 
>               VectorUnboxNode -> LoadVector + VectorLoadMask 
>           ```
> 
>          At this stage vector type (TypeVectMask(16,BOOL)) earlier associated with vunbox node is used to create the
>         type for VectorLoadMask operation.
>     
>       - Masks can be propagated either though a vector (non-AVX512 targets) or using opmask registers (K1-K7).
>         Decision to create correct ideal type based on the target features is delegated to low level
>         type creation routine TypeVect::makemask.
>     
>       - This creates problem for targets like KNL which support limited set of AVX-512 features i.e. do
>         no support AVX512VL and AVX512BW feature.
>     
>       - For Int512 species initial ideal type constructed during parsing is based on primitive type and
>         lane count associated with species, but during macro expansion type creation
>         decision is based on vector type associated with v[u]box nodes i.e. TypeVectoMask(16,BOOL),
>         thus for KNL target incorrect vector mask type TypeVectX(16,BOOL) gets created since it does not
>         support vector length extension(128,256 bit operation over EVEX encoded instruction).
>     
>       - There are multiple ways to fix this discrepancy, cleanest approach is to create ideal type TypeVectoMask 
>         based on the primitive lane type of the species, instead of always setting the lane type as BOOLEAN.
>         This will also preserve the original lane type information which was needed in some cases e.g.
>         reinterpretation operation over mask. To circumvent such issue explicit src/dst primitive types
>         were added to ideal nodes.
>     
>       - Also this does not disturbs the register mask and spilling behavior associated with opmask registers
>         thus the change is transparent to backend passes.
> 	
> Validation:
> Patch regressed through tier1-3 tests at AVX Level=0,1,2,3 and UseKNLSetting

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 3947:

> 3945:       evrold(eType, dst, mask, src1, src2, merge, vlen_enc); break;
> 3946:     case Op_RotateRightV:
> 3947:       evrord(eType, dst, mask, src1, src2, merge, vlen_enc); break;

is_varshift is not passed to evrold/evrord here. But x86.ad is sending is_varshift for rotates.

src/hotspot/cpu/x86/x86.ad line 1970:

> 1968:     case Op_OrVMask:
> 1969:     case Op_XorVMask:
> 1970:       if (vlen > 16 && !VM_Version::supports_avx512bw()) {

Isn't there a limitation as well for vlen 8 and DQ support?

src/hotspot/cpu/x86/x86.ad line 7201:

> 7199: 
> 7200: instruct evcmpFD64(vec dst, vec src1, vec src2, immI8 cond, rRegP scratch, kReg ktmp) %{
> 7201:   predicate(!VM_Version::supports_avx512vl() &&

Why check for !avx512vl here  in evcmpFD64 when the avx512vl is removed from evcmpFD.
Likewise for vcmpu64.

src/hotspot/cpu/x86/x86.ad line 7336:

> 7334: instruct evcmp(kReg dst, vec src1, vec src2, immI8 cond) %{
> 7335:   predicate(UseAVX > 2 &&
> 7336:             n->bottom_type()->isa_vectmask() && // src1

why does the comment say src1?

src/hotspot/cpu/x86/x86.ad line 8991:

> 8989:   match(Set dst (RotateLeftV (Binary dst src2) mask));
> 8990:   match(Set dst (RotateRightV (Binary dst src2) mask));
> 8991:   format %{ "vrotate_masked $dst, $dst, $src2\t! rotate masked operation" %}

No mask register shown in format.

src/hotspot/cpu/x86/x86.ad line 8998:

> 8996:     bool is_varshift = !VectorNode::is_vshift_cnt_opcode(in(2)->isa_Mach()->ideal_Opcode());
> 8997:     __ evmasked_op(opc, bt, $mask$$KRegister, $dst$$XMMRegister,
> 8998:                    $dst$$XMMRegister, $src2$$XMMRegister, true, vlen_enc, is_varshift);

x86.ad is sending is_varshift for rotates but is_varshift is not passed to evrold/evrord in evmasked_op.

src/hotspot/cpu/x86/x86.ad line 9007:

> 9005:   match(Set dst (LShiftVI (Binary dst (LShiftCntV shift)) mask));
> 9006:   match(Set dst (LShiftVL (Binary dst (LShiftCntV shift)) mask));
> 9007:   format %{ "vplshift_imm_masked $dst, $dst, $shift\t! lshift masked operation" %}

No mask register shown in format.

src/hotspot/cpu/x86/x86.ad line 9053:

> 9051:   match(Set dst (RShiftVI (Binary dst (RShiftCntV shift)) mask));
> 9052:   match(Set dst (RShiftVL (Binary dst (RShiftCntV shift)) mask));
> 9053:   format %{ "vprshift_imm_masked $dst, $dst, $shift\t! rshift masked operation" %}

No mask register shown in format.

src/hotspot/cpu/x86/x86.ad line 9099:

> 9097:   match(Set dst (URShiftVI (Binary dst (RShiftCntV shift)) mask));
> 9098:   match(Set dst (URShiftVL (Binary dst (RShiftCntV shift)) mask));
> 9099:   format %{ "vpurshift_imm_masked $dst, $dst, $shift\t! urshift masked operation" %}

No mask register shown in format.

src/hotspot/cpu/x86/x86.ad line 9306:

> 9304:   match(Set dst (MaskAll cnt));
> 9305:   effect(TEMP_DEF dst, TEMP tmp);
> 9306:   format %{ "mask_all_evexI $dst, $cnt \t! mask all operation" %}

No need to say TEMP_DEF for dst. Also good to show tmp register in format.

src/hotspot/cpu/x86/x86.ad line 9326:

> 9324:   match(Set dst (MaskAll src));
> 9325:   effect(TEMP_DEF dst, TEMP tmp);
> 9326:   format %{ "mask_all_evexI $dst, $src \t! mask all operation" %}

No need to say TEMP_DEF for dst. Also good to show tmp register in format.

src/hotspot/cpu/x86/x86.ad line 9360:

> 9358: %}
> 9359: 
> 9360: instruct mask_not_immLT8(kReg dst, kReg src, rRegI rtmp, kReg ktmp, immI_M1 cnt) %{

What happens to not operation if vector_length < 8 and it is a !avx512dq platform?

src/hotspot/cpu/x86/x86.ad line 9364:

> 9362:   match(Set dst (XorVMask src (MaskAll cnt)));
> 9363:   effect(TEMP_DEF dst, TEMP rtmp, TEMP ktmp);
> 9364:   format %{ "mask_not_LT8 $dst, $src, $cnt \t! mask not operation" %}

Good to show temp registers in format, helps in debugging.

src/hotspot/cpu/x86/x86.ad line 9375:

> 9373:   predicate((Matcher::vector_length(n) == 8 && VM_Version::supports_avx512dq()) ||
> 9374:             (Matcher::vector_length(n) == 16) ||
> 9375:             (Matcher::vector_length(n) > 16 && VM_Version::supports_avx512bw()));

What happens to not operation if vector_length == 8 and it is a !avx512dq platform?

src/hotspot/cpu/x86/x86.ad line 9397:

> 9395:     assert(0 == Type::cmp(mask1->bottom_type(), mask2->bottom_type()), "");
> 9396:     uint masklen = Matcher::vector_length(this);
> 9397:     masklen = masklen < 16 && !VM_Version::supports_avx512dq() ? 16 : masklen;

Good to add parenthesis here: 
(masklen < 16 && !VM_Version::supports_avx512dq())

-------------

PR: https://git.openjdk.java.net/panama-vector/pull/122