[vectorIntrinsics+mask] RFR: 8270349: Initial X86 backend support for optimizing masking operations on AVX512 targets. [v4]

Sat Aug 14 21:27:23 UTC 2021

On Fri, 13 Aug 2021 22:39:44 GMT, Sandhya Viswanathan <sviswanathan at openjdk.org> wrote:

>> Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision:
>> 
>>   8270349: Review comments resolution.
>
> src/hotspot/cpu/x86/assembler_x86.cpp line 8866:
> 
>> 8864: }
>> 8865: 
>> 8866: void Assembler::evpfma213ps(XMMRegister dst, KRegister mask, XMMRegister nds, Address src, bool merge, int vector_len) {
> 
> You need fma231 here which is  dst=dst + nds * src.

This is conforming with generated IR shape, dst is first source i.e. src1 followed by src2  and  src3
FMA213 = src1*src2 + src3.

> src/hotspot/cpu/x86/assembler_x86.cpp line 8881:
> 
>> 8879: }
>> 8880: 
>> 8881: void Assembler::evpfma213pd(XMMRegister dst, KRegister mask, XMMRegister nds, XMMRegister src, bool merge, int vector_len) {
> 
> You need fma231 here which is dst=dst + nds * src.

same as above

> src/hotspot/cpu/x86/assembler_x86.cpp line 8893:
> 
>> 8891: }
>> 8892: 
>> 8893: void Assembler::evpfma213pd(XMMRegister dst, KRegister mask, XMMRegister nds, Address src, bool merge, int vector_len) {
> 
> You need fma231 here which is dst=dst + nds * src.

same as above

> src/hotspot/cpu/x86/x86.ad line 7752:
> 
>> 7750:             static_cast<const VectorTestNode*>(n->in(1))->get_predicate() == BoolTest::ne &&
>> 7751:             vector_length(n->in(1)->in(1)) >= 8);
>> 7752:   match(Set cr (CmpI (VectorTest src1 src2) zero));
> 
> Similar pattern can be added for BoolTest::overflow (alltrue) case:
> knot tmp, src
> kortest tmp, tmp

Yes, this is missing in mainline code too. Taken a note of this, will be added in subsequent patch.

> src/hotspot/cpu/x86/x86.ad line 7931:
> 
>> 7929: 
>> 7930: instruct vstoreMask8B_evex(vec dst, vec src, immI_8 size) %{
>> 7931:   predicate(UseAVX > 2 && NULL == n->in(1)->bottom_type()->isa_vectmask());
> 
> A nit-pick traditional usage is  (n->in(1)->bottom_type()->isa_vectmask() == NULL).

DONE, in general I follow CONST == VAR (lvalue) convention to avoid accidental assignment in comparison condition (VAR = CONST)

> src/hotspot/cpu/x86/x86.ad line 7955:
> 
>> 7953:     __ vpxor($dst$$XMMRegister, $dst$$XMMRegister, $dst$$XMMRegister, dst_vlen_enc);
>> 7954:     __ evmovdqub($dst$$XMMRegister, $mask$$KRegister, ExternalAddress(vector_masked_cmp_bits()),
>> 7955:                   true, dst_vlen_enc, $scratch$$Register);
> 
> If you do merge-masking as false, then dst need not be cleared using vxor and the rule can be simplified.
> You could alternatively also use the vmovm2b followed by vpabsb. Thereby eliminating the need for vector_masked_cmp_bits.

Yes, proposed sequence  though have more number of instructions but will also prevent any penalty due to cache miss.
existing:
  VCMP  K1  MASK_VEC, MEM_BIT_PATTERN
Proposed:
  VXOR  V1 V1
  VSUB  V1 , MASK_VEC
  VMOVM2B   K1 , V1
 I did quick perf runs and did see increased instruction could

> src/hotspot/cpu/x86/x86.ad line 8996:
> 
>> 8994:     int opc = this->ideal_Opcode();
>> 8995:     __ evmasked_op(opc, bt, $mask$$KRegister, $dst$$XMMRegister,
>> 8996:                    $dst$$XMMRegister, $src2$$XMMRegister, false, vlen_enc);
> 
> Since merge masking is false here, dst and src1 could be separate registers.
> There is another flavor of rearrange with second vector, e.g.:
>  IntVector rearrange(VectorShuffle<Integer> s, Vector<Integer> v);
> Which can use rearrange with merge masking true.
> I don't see a rule for that. Do you plan to add that later?

All these instructions are in two address format, thus src1 is copied to dst first.  Second flavor of rearrange does not accept mask argument. This pattern specifically handled the case with mask.

> src/hotspot/cpu/x86/x86.ad line 9010:
> 
>> 9008:     int opc = this->ideal_Opcode();
>> 9009:     __ evmasked_op(opc, bt, $mask$$KRegister, $dst$$XMMRegister,
>> 9010:                    $dst$$XMMRegister, $src2$$Address, false, vlen_enc);
> 
> Since merge masking is false here, dst and src1 could be separate registers.

same as above

> src/hotspot/cpu/x86/x86.ad line 9021:
> 
>> 9019:   match(Set dst (AbsVL dst mask));
>> 9020:   format %{ "vabs_masked $dst, $mask \t! vabs masked operation" %}
>> 9021:   ins_cost(100);
> 
> It is not clear why ins_cost is required for matching?

To give this preference over Set dst (AbsVI src) pattern

> src/hotspot/cpu/x86/x86.ad line 9057:
> 
>> 9055:     int opc = this->ideal_Opcode();
>> 9056:     __ evmasked_op(opc, bt, $mask$$KRegister, $dst$$XMMRegister,
>> 9057:                    $src2$$XMMRegister, $src3$$Address, true, vlen_enc);
> 
> This and the previous instruct should translate to fma231.

same as above

> src/hotspot/cpu/x86/x86.ad line 9063:
> 
>> 9061: 
>> 9062: instruct evcmp_masked(kReg dst, vec src1, vec src2, immI8 cond, kReg mask, rRegP scratch) %{
>> 9063:   predicate(UseAVX > 2);
> 
> Is this check enough? How about vl?

That is already checked in match_rule_supported_vector_masked. Infact UseAVX > 2 is also redundant in all newly added patterns.

> src/hotspot/cpu/x86/x86.ad line 9121:
> 
>> 9119:     int vlen_enc = vector_length_encoding(vector_length(this));
>> 9120:     __ evpbroadcastb($xtmp$$XMMRegister, $src$$Register, vlen_enc);
>> 9121:     __ evpmovb2m($dst$$KRegister, $xtmp$$XMMRegister, vlen_enc);
> 
> Since either all bits are set or clear in $src, we could just do a kmov from $src into $dst with appropriate width.

same as above

> src/hotspot/cpu/x86/x86.ad line 9134:
> 
>> 9132:     int vlen_enc = vector_length_encoding(vector_length(this));
>> 9133:     __ evpbroadcastb($xtmp$$XMMRegister, $src$$Register, vlen_enc);
>> 9134:     __ evpmovb2m($dst$$KRegister, $xtmp$$XMMRegister, vlen_enc);
> 
> Since either all bits are set or clear in $src, we could just do a kmov from $src into $dst with appropriate width.

This is a mask all operation, mask (src) is initially only 0/-1,  this needs to be broadcasted into vector and then movb2m picks the MSB bits from each byte lane and moves it to opmask register (dst).

-------------

PR: https://git.openjdk.java.net/panama-vector/pull/99