[vectorIntrinsics+mask] RFR: 8270349: Initial X86 backend support for optimizing masking operations on AVX512 targets. [v4]

Sat Aug 14 01:18:50 UTC 2021

On Fri, 13 Aug 2021 18:24:01 GMT, Jatin Bhateja <jbhateja at openjdk.org> wrote:

>> Intel targets supporting AVX512 feature offer predicated vector instructions. These are vector operations on selected vector lanes under the influence of opmask register. For non-AVX512 targets, masked vector operations are supported using an explicit vector blend operation after main vector operation which does the needed selection. 
>> 
>> This patch adds initial X86 backed support for predicated vector operations. 
>> 
>> Following is performance data for existing VectorAPI JMH benchmarks with the patch:
>> Test System:  Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz (Icelake server 40C 2S)
>> 
>> Benchmark | SIZE | Baseline (ops/ms) | WithOpts (ops/ms) | Gain
>> -- | -- | -- | -- | --
>> Int512Vector.ABSMasked | 1024 | 10132.664 | 10394.942 | 1.025884407
>> Int512Vector.ADDMasked | 1024 | 7785.805 | 8980.133 | 1.153398139
>> Int512Vector.ADDMaskedLanes | 1024 | 5809.455 | 6350.628 | 1.093153833
>> Int512Vector.ANDMasked | 1024 | 7776.729 | 8965.988 | 1.152925349
>> Int512Vector.ANDMaskedLanes | 1024 | 6717.202 | 7426.217 | 1.105552133
>> Int512Vector.AND_NOTMasked | 1024 | 7688.835 | 8988.659 | 1.169053439
>> Int512Vector.ASHRMasked | 1024 | 6808.185 | 7883.755 | 1.1579819
>> Int512Vector.ASHRMaskedShift | 1024 | 9523.164 | 12166.72 | 1.277592195
>> Int512Vector.BITWISE_BLENDMasked | 1024 | 5919.647 | 6864.988 | 1.159695502
>> Int512Vector.DIVMasked | 1024 | 237.174 | 236.014 | 0.995109076
>> Int512Vector.FIRST_NONZEROMasked | 1024 | 5387.315 | 7890.42 | 1.464629412
>> Int512Vector.LSHLMasked | 1024 | 6806.898 | 7881.315 | 1.157842383
>> Int512Vector.LSHLMaskedShift | 1024 | 9552.257 | 12153.769 | 1.272345269
>> Int512Vector.LSHRMasked | 1024 | 6776.605 | 7897.786 | 1.165448776
>> Int512Vector.LSHRMaskedShift | 1024 | 9500.087 | 12134.962 | 1.277352723
>> Int512Vector.MAXMaskedLanes | 1024 | 6993.149 | 7580.399 | 1.083975045
>> Int512Vector.MINMaskedLanes | 1024 | 6925.363 | 7450.814 | 1.075873424
>> Int512Vector.MULMasked | 1024 | 7732.753 | 8956.02 | 1.158192949
>> Int512Vector.MULMaskedLanes | 1024 | 4066.384 | 4152.375 | 1.021146798
>> Int512Vector.NEGMasked | 1024 | 8760.797 | 9255.063 | 1.056417926
>> Int512Vector.NOTMasked | 1024 | 8981.123 | 9229.573 | 1.027663578
>> Int512Vector.ORMasked | 1024 | 7786.787 | 8967.057 | 1.151573428
>> Int512Vector.ORMaskedLanes | 1024 | 6694.36 | 7450.106 | 1.112892943
>> Int512Vector.SUBMasked | 1024 | 7782.939 | 9001.692 | 1.156592901
>> Int512Vector.XORMasked | 1024 | 7785.031 | 9070.342 | 1.165100306
>> Int512Vector.XORMaskedLanes | 1024 | 6700.689 | 7454.73 | 1.112531861
>> Int512Vector.ZOMOMasked | 1024 | 6982.297 | 8313.51 | 1.190655453
>> Int512Vector.gatherMasked | 1024 | 361.497 | 1494.876 | 4.135237637
>> Int512Vector.scatterMasked | 1024 | 490.05 | 3120.425 | 6.367564534
>> Int512Vector.sliceMasked | 1024 | 1436.248 | 1597.805 | 1.112485448
>> Int512Vector.unsliceMasked | 1024 | 296.721 | 346.434 | 1.167541226
>> Float512Vector.ADDMasked | 1024 | 7645.873 | 9123.386 | 1.193243205
>> Float512Vector.ADDMaskedLanes | 1024 | 2404.371 | 2529.284 | 1.051952465
>> Float512Vector.DIVMasked | 1024 | 5134.602 | 5129.085 | 0.998925525
>> Float512Vector.FIRST_NONZEROMasked | 1024 | 5040.567 | 7078.828 | 1.404371373
>> Float512Vector.FMAMasked | 1024 | 5996.419 | 6902.626 | 1.151124696
>> Float512Vector.MAXMaskedLanes | 1024 | 1681.249 | 1727.444 | 1.027476596
>> Float512Vector.MINMaskedLanes | 1024 | 1610.115 | 1667.143 | 1.035418588
>> Float512Vector.MULMasked | 1024 | 7812.317 | 9054.137 | 1.158956683
>> Float512Vector.MULMaskedLanes | 1024 | 2406.81 | 2514.018 | 1.044543608
>> Float512Vector.NEGMasked | 1024 | 8248.933 | 9834.607 | 1.192227771
>> Float512Vector.SQRTMasked | 1024 | 4278.046 | 4281.009 | 1.000692606
>> Float512Vector.SUBMasked | 1024 | 7697.582 | 9044.305 | 1.174954031
>> Float512Vector.gatherMasked | 1024 | 428.428 | 1491.441 | 3.48119404
>> Float512Vector.scatterMasked | 1024 | 416.169 | 3216.628 | 7.729138883
>> Float512Vector.sliceMasked | 1024 | 1431.07 | 1609.12 | 1.124417394
>> Float512Vector.unsliceMasked | 1024 | 292.513 | 331.366 | 1.132824866
>> 
>> 
>> 
>> PS: Above data shows the performance gains for two vector species Int512, Float512.  In general for all the species we see 1.2-2.x gains on various masking operation supported uptill now.
>> New matcher routine `Matcher::match_rule_supported_vector_masked`   lists making operations supported by this patch.
>
> Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision:
> 
>   8270349: Review comments resolution.

src/hotspot/cpu/x86/assembler_x86.cpp line 8866:

> 8864: }
> 8865: 
> 8866: void Assembler::evpfma213ps(XMMRegister dst, KRegister mask, XMMRegister nds, Address src, bool merge, int vector_len) {

You need fma231 here which is  dst=dst + nds * src.

src/hotspot/cpu/x86/assembler_x86.cpp line 8881:

> 8879: }
> 8880: 
> 8881: void Assembler::evpfma213pd(XMMRegister dst, KRegister mask, XMMRegister nds, XMMRegister src, bool merge, int vector_len) {

You need fma231 here which is dst=dst + nds * src.

src/hotspot/cpu/x86/assembler_x86.cpp line 8893:

> 8891: }
> 8892: 
> 8893: void Assembler::evpfma213pd(XMMRegister dst, KRegister mask, XMMRegister nds, Address src, bool merge, int vector_len) {

You need fma231 here which is dst=dst + nds * src.

src/hotspot/cpu/x86/x86.ad line 3764:

> 3762:   match(Set dst (LoadVectorGather mem idx));
> 3763:   effect(TEMP dst, TEMP tmp, TEMP ktmp);
> 3764:   format %{ "load_vector_gather $dst, $mem, $idx\t! using $tmp and k2 as TEMP" %}

There is no k2 here, should be ktmp.

src/hotspot/cpu/x86/x86.ad line 3784:

> 3782:   match(Set dst (LoadVectorGatherMasked mem (Binary idx mask)));
> 3783:   effect(TEMP_DEF dst, TEMP tmp, TEMP ktmp);
> 3784:   format %{ "load_vector_gather_masked $dst, $mem, $idx, $mask\t! using $tmp and k2 as TEMP" %}

There is no k2 here, should be ktmp.

src/hotspot/cpu/x86/x86.ad line 7085:

> 7083: %}
> 7084: 
> 7085: instruct evcmpFD(kReg dst, vec src1, vec src2, immI8 cond, rRegP scratch) %{

scratch register is not used.

src/hotspot/cpu/x86/x86.ad line 7196:

> 7194: 
> 7195: 
> 7196: instruct evcmp(kReg dst, vec src1, vec src2, immI8 cond, rRegP scratch) %{

scratch register is not used.

src/hotspot/cpu/x86/x86.ad line 7743:

> 7741:     __ kshiftrbl($ktmp$$KRegister, $ktmp$$KRegister, 8-masklen);
> 7742:     __ kandbl($ktmp$$KRegister, $ktmp$$KRegister, $src1$$KRegister);
> 7743:     __ ktestbl($ktmp$$KRegister, $ktmp$$KRegister);

Could be replaced by single instruction on similar lines as in anyTrue:
__ ktestbl($ktmp$$KRegister, $src1$$KRegister);

src/hotspot/cpu/x86/x86.ad line 7752:

> 7750:             static_cast<const VectorTestNode*>(n->in(1))->get_predicate() == BoolTest::ne &&
> 7751:             vector_length(n->in(1)->in(1)) >= 8);
> 7752:   match(Set cr (CmpI (VectorTest src1 src2) zero));

Similar pattern can be added for BoolTest::overflow (alltrue) case:
knot tmp, src
kortest tmp, tmp

src/hotspot/cpu/x86/x86.ad line 7790:

> 7788:     int vlen_enc = vector_length_encoding(vlen_in_bytes);
> 7789:     __ evpcmp(T_BYTE, $dst$$KRegister, k0, $src$$XMMRegister, ExternalAddress(vector_masked_cmp_bits()),
> 7790:               Assembler::eq, vlen_enc, $scratch$$Register);

We could use evpmovb2m here.

src/hotspot/cpu/x86/x86.ad line 7902:

> 7900: 
> 7901: instruct vstoreMask2B_evex(vec dst, vec src, immI_2 size) %{
> 7902:   predicate(VM_Version::supports_avx512bw());

Do we need the check (n->in(1)->bottom_type()->isa_vectmask() == NULL) here and in vstoreMask4B_evex?
If not, why this check is there in vstoreMask8B_evex.

src/hotspot/cpu/x86/x86.ad line 7931:

> 7929: 
> 7930: instruct vstoreMask8B_evex(vec dst, vec src, immI_8 size) %{
> 7931:   predicate(UseAVX > 2 && NULL == n->in(1)->bottom_type()->isa_vectmask());

A nit-pick traditional usage is  (n->in(1)->bottom_type()->isa_vectmask() == NULL).

src/hotspot/cpu/x86/x86.ad line 7955:

> 7953:     __ vpxor($dst$$XMMRegister, $dst$$XMMRegister, $dst$$XMMRegister, dst_vlen_enc);
> 7954:     __ evmovdqub($dst$$XMMRegister, $mask$$KRegister, ExternalAddress(vector_masked_cmp_bits()),
> 7955:                   true, dst_vlen_enc, $scratch$$Register);

If you do merge-masking as false, then dst need not be cleared using vxor and the rule can be simplified.
You could alternatively also use the vmovm2b followed by vpabsb. Thereby eliminating the need for vector_masked_cmp_bits.

src/hotspot/cpu/x86/x86.ad line 8996:

> 8994:     int opc = this->ideal_Opcode();
> 8995:     __ evmasked_op(opc, bt, $mask$$KRegister, $dst$$XMMRegister,
> 8996:                    $dst$$XMMRegister, $src2$$XMMRegister, false, vlen_enc);

Since merge masking is false here, dst and src1 could be separate registers.
There is another flavor of rearrange with second vector, e.g.:
 IntVector rearrange(VectorShuffle<Integer> s, Vector<Integer> v);
Which can use rearrange with merge masking true.
I don't see a rule for that. Do you plan to add that later?

src/hotspot/cpu/x86/x86.ad line 9001:

> 8999: %}
> 9000: 
> 9001: instruct vrearrangev_mem_masked(vec dst, memory src2, kReg mask) %{

How is VectorRearrange getting the shuffle through memory directly? There is always VectorLoadShuffle isn't it?

src/hotspot/cpu/x86/x86.ad line 9010:

> 9008:     int opc = this->ideal_Opcode();
> 9009:     __ evmasked_op(opc, bt, $mask$$KRegister, $dst$$XMMRegister,
> 9010:                    $dst$$XMMRegister, $src2$$Address, false, vlen_enc);

Since merge masking is false here, dst and src1 could be separate registers.

src/hotspot/cpu/x86/x86.ad line 9021:

> 9019:   match(Set dst (AbsVL dst mask));
> 9020:   format %{ "vabs_masked $dst, $mask \t! vabs masked operation" %}
> 9021:   ins_cost(100);

It is not clear why ins_cost is required for matching?

src/hotspot/cpu/x86/x86.ad line 9057:

> 9055:     int opc = this->ideal_Opcode();
> 9056:     __ evmasked_op(opc, bt, $mask$$KRegister, $dst$$XMMRegister,
> 9057:                    $src2$$XMMRegister, $src3$$Address, true, vlen_enc);

This and the previous instruct should translate to fma231.

src/hotspot/cpu/x86/x86.ad line 9063:

> 9061: 
> 9062: instruct evcmp_masked(kReg dst, vec src1, vec src2, immI8 cond, kReg mask, rRegP scratch) %{
> 9063:   predicate(UseAVX > 2);

Is this check enough? How about vl?

src/hotspot/cpu/x86/x86.ad line 9108:

> 9106:         break;
> 9107:       }
> 9108:       default: assert(false, "%s", type2name(src1_elem_bt));

Missing break; here.

src/hotspot/cpu/x86/x86.ad line 9121:

> 9119:     int vlen_enc = vector_length_encoding(vector_length(this));
> 9120:     __ evpbroadcastb($xtmp$$XMMRegister, $src$$Register, vlen_enc);
> 9121:     __ evpmovb2m($dst$$KRegister, $xtmp$$XMMRegister, vlen_enc);

Since either all bits are set or clear in $src, we could just do a kmov from $src into $dst with appropriate width.

src/hotspot/cpu/x86/x86.ad line 9134:

> 9132:     int vlen_enc = vector_length_encoding(vector_length(this));
> 9133:     __ evpbroadcastb($xtmp$$XMMRegister, $src$$Register, vlen_enc);
> 9134:     __ evpmovb2m($dst$$KRegister, $xtmp$$XMMRegister, vlen_enc);

Since either all bits are set or clear in $src, we could just do a kmov from $src into $dst with appropriate width.

src/hotspot/cpu/x86/x86.ad line 9145:

> 9143:   match(Set dst (OrVMask src1 src2));
> 9144:   match(Set dst (XorVMask src1 src2));
> 9145:   effect(TEMP kscratch);

kscratch is not being used in ins_encode.

-------------

PR: https://git.openjdk.java.net/panama-vector/pull/99