[vectorIntrinsics] RFR: 8274569: X86 backend related incorrectness issues in legacy store mask patterns

Mon Oct 4 06:37:55 UTC 2021

On Thu, 30 Sep 2021 19:34:43 GMT, Jatin Bhateja <jbhateja at openjdk.org> wrote:

> - Issue was seen while unit testing changes for masking related optimizations on vectorIntrinsics branch.
> 
> - Following patterns result into incorrect store mask computation, even though contextually store-mask is followed by StoreVector which does takes care of various vector sizes. But C2 does emit VectorStoreMask in IR to populate a byte vector. 
> 
>   - instruct storeMask2B
>   - instruct storeMask4B
>   - instruct storeMask8B
>   - instruct storeMask8B_avx
> 
> - Replicate with immI operand is correctly taking care of shorter vector lengths thus it can be fall back case for following pattern with immediate -1 argument. 
>   - instruct ReplI_M1
>   
> Patch also adds a test case which exhaustively coves load/store vector masks operations for different SPECIES.

> Do we still need `andq`?
> 
> ```c++
> void C2_MacroAssembler::vector_mask_operation(int opc, Register dst, XMMRegister mask, XMMRegister xtmp,
>                                               XMMRegister xtmp1, Register tmp, int masklen, int vec_enc) {
>   assert(VM_Version::supports_avx(), "");
>   vpxor(xtmp, xtmp, xtmp, vec_enc);
>   vpsubb(xtmp, xtmp, mask, vec_enc);
>   vpmovmskb(tmp, xtmp, vec_enc);
>   if (masklen < 64) {
>     andq(tmp, (((jlong)1 << masklen) - 1));               // <---- Do we still need this?
>   }
>   switch(opc) {
>     case Op_VectorMaskTrueCount:
>       popcntq(dst, tmp);
>       break;
>     case Op_VectorMaskLastTrue:
>       mov64(dst, -1);
>       bsrq(tmp, tmp);
>       cmov(Assembler::notZero, dst, tmp);
>       break;
>     case Op_VectorMaskFirstTrue:
>       mov64(dst, masklen);
>       bsfq(tmp, tmp);
>       cmov(Assembler::notZero, dst, tmp);
>       break;
>     default: assert(false, "Unhandled mask operation");
>   }
> }
> ```

There are four ways in which masks enter into the system:- 
1) Direct mask loading (VectorLoadMask): This is generally preceded by a LoadVector which always ensures loading the exact sized memory, so this case is always guaranteed to populate mask without over shooting.
2) Vector comparison (VectorMaskCmp): This case may set more bits in mask than actually impacted by comparisons, if we compare partial vectors. ( less than 16 bytes).
3) Mask All Operation: This case is again based on Replicate instruction and existing patterns for Repl[BSI]_reg does not ensure exact broadcasting for partial vector scenarios. Newer patterns for AVX512 does generate correct masks whose length matches with species length.
4) Mask manipulation operations like And/Or/Xor. 

I did fix store mask patterns in this patch, since it wasn't costly, but fixing all the patterns looks un-reasonable and un-optimal and beyond the scope of this patch. 
I am adding the fix you proposed for tolong in the patch as it’s full proof, also similar change is needed to strengthen EVEX pattern. Similar handling is needed for EVEX pattern also.

Kindly review and let me know if this looks ok to you.

-------------

PR: https://git.openjdk.java.net/panama-vector/pull/139