[vectorIntrinsics+mask] RFR: 8273406: Optimize various masked vector operations for AVX512 target.
Sandhya Viswanathan
sviswanathan at openjdk.java.net
Thu Sep 9 22:58:19 UTC 2021
On Tue, 7 Sep 2021 14:53:25 GMT, Jatin Bhateja <jbhateja at openjdk.org> wrote:
> This patch is in continuation to X86 backend support for optimizing masked operations over AVX-512 targets (JDK-8262356).
>
> Summary of changes:
>
> 1) Support for masked rotate left and right operations over integer/long vectors.
>
> 2) Support for masked square root operation over float/double vectors.
>
> 3) Support for masked logical shiftleft and logical/arithmetic shiftright operation with constant shift count.
>
> 4) Optimized VectorMask.not operation by emitting direct KNOT instruction.
>
> 5) Extended masking optimization support for X86 KNL target which has limited set of AVX-512 features.
>
> - Currently vector type associated with VectorLoadMask operation is created during parsing stage.
> For targets supporting opmask registers, lane type is explicitly set to BOOLEAN irrespective of the primitive
> type of species i.e. for Int512 species ideal type TypeVectMask(16,BOOL) represent vector of 16 BOOLEAN elements
> each of which represent a mask bit for corresponding vector lane.
> This type information is also associated with respective mask boxes (Int512Mask).
>
> - During macro expansion vbox/vunbox nodes are broken down into granular target mappable ideal nodes.
>
> ```
> VectorBoxNode -> VectorStoreMask + StoreVector
>
> VectorUnboxNode -> LoadVector + VectorLoadMask
> ```
>
> At this stage vector type (TypeVectMask(16,BOOL)) earlier associated with vunbox node is used to create the
> type for VectorLoadMask operation.
>
> - Masks can be propagated either though a vector (non-AVX512 targets) or using opmask registers (K1-K7).
> Decision to create correct ideal type based on the target features is delegated to low level
> type creation routine TypeVect::makemask.
>
> - This creates problem for targets like KNL which support limited set of AVX-512 features i.e. do
> no support AVX512VL and AVX512BW feature.
>
> - For Int512 species initial ideal type constructed during parsing is based on primitive type and
> lane count associated with species, but during macro expansion type creation
> decision is based on vector type associated with v[u]box nodes i.e. TypeVectoMask(16,BOOL),
> thus for KNL target incorrect vector mask type TypeVectX(16,BOOL) gets created since it does not
> support vector length extension(128,256 bit operation over EVEX encoded instruction).
>
> - There are multiple ways to fix this discrepancy, cleanest approach is to create ideal type TypeVectoMask
> based on the primitive lane type of the species, instead of always setting the lane type as BOOLEAN.
> This will also preserve the original lane type information which was needed in some cases e.g.
> reinterpretation operation over mask. To circumvent such issue explicit src/dst primitive types
> were added to ideal nodes.
>
> - Also this does not disturbs the register mask and spilling behavior associated with opmask registers
> thus the change is transparent to backend passes.
>
> Validation:
> Patch regressed through tier1-3 tests at AVX Level=0,1,2,3 and UseKNLSetting
src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 3947:
> 3945: evrold(eType, dst, mask, src1, src2, merge, vlen_enc); break;
> 3946: case Op_RotateRightV:
> 3947: evrord(eType, dst, mask, src1, src2, merge, vlen_enc); break;
is_varshift is not passed to evrold/evrord here. But x86.ad is sending is_varshift for rotates.
src/hotspot/cpu/x86/x86.ad line 1970:
> 1968: case Op_OrVMask:
> 1969: case Op_XorVMask:
> 1970: if (vlen > 16 && !VM_Version::supports_avx512bw()) {
Isn't there a limitation as well for vlen 8 and DQ support?
src/hotspot/cpu/x86/x86.ad line 7201:
> 7199:
> 7200: instruct evcmpFD64(vec dst, vec src1, vec src2, immI8 cond, rRegP scratch, kReg ktmp) %{
> 7201: predicate(!VM_Version::supports_avx512vl() &&
Why check for !avx512vl here in evcmpFD64 when the avx512vl is removed from evcmpFD.
Likewise for vcmpu64.
src/hotspot/cpu/x86/x86.ad line 7336:
> 7334: instruct evcmp(kReg dst, vec src1, vec src2, immI8 cond) %{
> 7335: predicate(UseAVX > 2 &&
> 7336: n->bottom_type()->isa_vectmask() && // src1
why does the comment say src1?
src/hotspot/cpu/x86/x86.ad line 8991:
> 8989: match(Set dst (RotateLeftV (Binary dst src2) mask));
> 8990: match(Set dst (RotateRightV (Binary dst src2) mask));
> 8991: format %{ "vrotate_masked $dst, $dst, $src2\t! rotate masked operation" %}
No mask register shown in format.
src/hotspot/cpu/x86/x86.ad line 8998:
> 8996: bool is_varshift = !VectorNode::is_vshift_cnt_opcode(in(2)->isa_Mach()->ideal_Opcode());
> 8997: __ evmasked_op(opc, bt, $mask$$KRegister, $dst$$XMMRegister,
> 8998: $dst$$XMMRegister, $src2$$XMMRegister, true, vlen_enc, is_varshift);
x86.ad is sending is_varshift for rotates but is_varshift is not passed to evrold/evrord in evmasked_op.
src/hotspot/cpu/x86/x86.ad line 9007:
> 9005: match(Set dst (LShiftVI (Binary dst (LShiftCntV shift)) mask));
> 9006: match(Set dst (LShiftVL (Binary dst (LShiftCntV shift)) mask));
> 9007: format %{ "vplshift_imm_masked $dst, $dst, $shift\t! lshift masked operation" %}
No mask register shown in format.
src/hotspot/cpu/x86/x86.ad line 9053:
> 9051: match(Set dst (RShiftVI (Binary dst (RShiftCntV shift)) mask));
> 9052: match(Set dst (RShiftVL (Binary dst (RShiftCntV shift)) mask));
> 9053: format %{ "vprshift_imm_masked $dst, $dst, $shift\t! rshift masked operation" %}
No mask register shown in format.
src/hotspot/cpu/x86/x86.ad line 9099:
> 9097: match(Set dst (URShiftVI (Binary dst (RShiftCntV shift)) mask));
> 9098: match(Set dst (URShiftVL (Binary dst (RShiftCntV shift)) mask));
> 9099: format %{ "vpurshift_imm_masked $dst, $dst, $shift\t! urshift masked operation" %}
No mask register shown in format.
src/hotspot/cpu/x86/x86.ad line 9306:
> 9304: match(Set dst (MaskAll cnt));
> 9305: effect(TEMP_DEF dst, TEMP tmp);
> 9306: format %{ "mask_all_evexI $dst, $cnt \t! mask all operation" %}
No need to say TEMP_DEF for dst. Also good to show tmp register in format.
src/hotspot/cpu/x86/x86.ad line 9326:
> 9324: match(Set dst (MaskAll src));
> 9325: effect(TEMP_DEF dst, TEMP tmp);
> 9326: format %{ "mask_all_evexI $dst, $src \t! mask all operation" %}
No need to say TEMP_DEF for dst. Also good to show tmp register in format.
src/hotspot/cpu/x86/x86.ad line 9360:
> 9358: %}
> 9359:
> 9360: instruct mask_not_immLT8(kReg dst, kReg src, rRegI rtmp, kReg ktmp, immI_M1 cnt) %{
What happens to not operation if vector_length < 8 and it is a !avx512dq platform?
src/hotspot/cpu/x86/x86.ad line 9364:
> 9362: match(Set dst (XorVMask src (MaskAll cnt)));
> 9363: effect(TEMP_DEF dst, TEMP rtmp, TEMP ktmp);
> 9364: format %{ "mask_not_LT8 $dst, $src, $cnt \t! mask not operation" %}
Good to show temp registers in format, helps in debugging.
src/hotspot/cpu/x86/x86.ad line 9375:
> 9373: predicate((Matcher::vector_length(n) == 8 && VM_Version::supports_avx512dq()) ||
> 9374: (Matcher::vector_length(n) == 16) ||
> 9375: (Matcher::vector_length(n) > 16 && VM_Version::supports_avx512bw()));
What happens to not operation if vector_length == 8 and it is a !avx512dq platform?
src/hotspot/cpu/x86/x86.ad line 9397:
> 9395: assert(0 == Type::cmp(mask1->bottom_type(), mask2->bottom_type()), "");
> 9396: uint masklen = Matcher::vector_length(this);
> 9397: masklen = masklen < 16 && !VM_Version::supports_avx512dq() ? 16 : masklen;
Good to add parenthesis here:
(masklen < 16 && !VM_Version::supports_avx512dq())
-------------
PR: https://git.openjdk.java.net/panama-vector/pull/122
More information about the panama-dev
mailing list