[vectorIntrinsics+mask] RFR: 8273406: Optimize various masked vector operations for AVX512 target.
Sandhya Viswanathan
sviswanathan at openjdk.java.net
Thu Sep 9 00:10:11 UTC 2021
On Tue, 7 Sep 2021 14:53:25 GMT, Jatin Bhateja <jbhateja at openjdk.org> wrote:
> This patch is in continuation to X86 backend support for optimizing masked operations over AVX-512 targets (JDK-8262356).
>
> Summary of changes:
>
> 1) Support for masked rotate left and right operations over integer/long vectors.
>
> 2) Support for masked square root operation over float/double vectors.
>
> 3) Support for masked logical shiftleft and logical/arithmetic shiftright operation with constant shift count.
>
> 4) Optimized VectorMask.not operation by emitting direct KNOT instruction.
>
> 5) Extended masking optimization support for X86 KNL target which has limited set of AVX-512 features.
>
> - Currently vector type associated with VectorLoadMask operation is created during parsing stage.
> For targets supporting opmask registers, lane type is explicitly set to BOOLEAN irrespective of the primitive
> type of species i.e. for Int512 species ideal type TypeVectMask(16,BOOL) represent vector of 16 BOOLEAN elements
> each of which represent a mask bit for corresponding vector lane.
> This type information is also associated with respective mask boxes (Int512Mask).
>
> - During macro expansion vbox/vunbox nodes are broken down into granular target mappable ideal nodes.
>
> ```
> VectorBoxNode -> VectorStoreMask + StoreVector
>
> VectorUnboxNode -> LoadVector + VectorLoadMask
> ```
>
> At this stage vector type (TypeVectMask(16,BOOL)) earlier associated with vunbox node is used to create the
> type for VectorLoadMask operation.
>
> - Masks can be propagated either though a vector (non-AVX512 targets) or using opmask registers (K1-K7).
> Decision to create correct ideal type based on the target features is delegated to low level
> type creation routine TypeVect::makemask.
>
> - This creates problem for targets like KNL which support limited set of AVX-512 features i.e. do
> no support AVX512VL and AVX512BW feature.
>
> - For Int512 species initial ideal type constructed during parsing is based on primitive type and
> lane count associated with species, but during macro expansion type creation
> decision is based on vector type associated with v[u]box nodes i.e. TypeVectoMask(16,BOOL),
> thus for KNL target incorrect vector mask type TypeVectX(16,BOOL) gets created since it does not
> support vector length extension(128,256 bit operation over EVEX encoded instruction).
>
> - There are multiple ways to fix this discrepancy, cleanest approach is to create ideal type TypeVectoMask
> based on the primitive lane type of the species, instead of always setting the lane type as BOOLEAN.
> This will also preserve the original lane type information which was needed in some cases e.g.
> reinterpretation operation over mask. To circumvent such issue explicit src/dst primitive types
> were added to ideal nodes.
>
> - Also this does not disturbs the register mask and spilling behavior associated with opmask registers
> thus the change is transparent to backend passes.
>
> Validation:
> Patch regressed through tier1-3 tests at AVX Level=0,1,2,3 and UseKNLSetting
src/hotspot/cpu/x86/assembler_x86.cpp line 8704:
> 8702:
> 8703: void Assembler::evsqrtps(XMMRegister dst, KRegister mask, XMMRegister nds, XMMRegister src, bool merge, int vector_len) {
> 8704: assert(vector_len == AVX_512bit || VM_Version::supports_avx512vl(), "");
The check for UseAVX>2 is missing here and other places when vector_len == AVX_512bit.
src/hotspot/cpu/x86/assembler_x86.cpp line 9086:
> 9084: }
> 9085:
> 9086: void Assembler::evpsllw(XMMRegister dst, KRegister mask, int shift, XMMRegister src, bool merge, int vector_len) {
For all the shift and rotate instructions operand order could be dst, mask, src, shift, merge, vector_len.
src/hotspot/cpu/x86/assembler_x86.cpp line 9112:
> 9110: void Assembler::evpsllq(XMMRegister dst, KRegister mask, int shift, XMMRegister src, bool merge, int vector_len) {
> 9111: assert(vector_len == AVX_512bit || VM_Version::supports_avx512vl(), "");
> 9112: InstructionAttr attributes(vector_len, /* vex_w */ false, /* legacy_mode */ false, /* no_mask_reg */ false, /* uses_vl */ true);
vex_w should be true here.
src/hotspot/cpu/x86/assembler_x86.cpp line 9148:
> 9146: void Assembler::evpsrlq(XMMRegister dst, KRegister mask, int shift, XMMRegister src, bool merge, int vector_len) {
> 9147: assert(vector_len == AVX_512bit || VM_Version::supports_avx512vl(), "");
> 9148: InstructionAttr attributes(vector_len, /* vex_w */ false, /* legacy_mode */ false, /* no_mask_reg */ false, /* uses_vl */ true);
vex_w should be true here.
src/hotspot/cpu/x86/assembler_x86.cpp line 9184:
> 9182: void Assembler::evpsraq(XMMRegister dst, KRegister mask, int shift, XMMRegister src, bool merge, int vector_len) {
> 9183: assert(vector_len == AVX_512bit || VM_Version::supports_avx512vl(), "");
> 9184: InstructionAttr attributes(vector_len, /* vex_w */ false, /* legacy_mode */ false, /* no_mask_reg */ false, /* uses_vl */ true);
vex_w should be true here.
src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 3856:
> 3854: Assembler::evpsrad(dst, mask, imm8, src1, merge, vlen_enc); break;
> 3855: case Op_URShiftVL:
> 3856: Assembler::evpsraq(dst, mask, imm8, src1, merge, vlen_enc); break;
Unsigned shift should be evpsrl and signed shift should be evpsra.
src/hotspot/share/opto/vectornode.hpp line 1397:
> 1395: : VectorNode(in, dst_vt), _src_vt(src_vt) {
> 1396: assert((!dst_vt->isa_vectmask() && !src_vt->isa_vectmask()) ||
> 1397: (type2aelembytes(src_vt->element_basic_type()) >= type2aelembytes(dst_vt->element_basic_type())),
Why this addition limitation for vectmask only? Some comments here please.
-------------
PR: https://git.openjdk.java.net/panama-vector/pull/122
More information about the panama-dev
mailing list