RFR: 8289186: Support predicated vector load/store operations over X86 AVX2 targets. [v2]

Tue Jul 5 18:50:51 UTC 2022

On Mon, 4 Jul 2022 16:39:29 GMT, Jatin Bhateja <jbhateja at openjdk.org> wrote:

>> Hi All,
>> 
>> [JDK-8283667](https://bugs.openjdk.org/browse/JDK-8283667) added the support to handle masked loads on non-predicated targets by blending the loaded contents with zero vector iff unmasked portion of load does not span beyond array bounds.
>> 
>> X86 AVX2 offers direct predicated vector loads/store instruction for non-sub word type.
>> 
>> This patch adds the efficient backend implementation for predicated memory operations over int/long/float/double vectors.
>> 
>> Please find below the JMH micro stats with and without patch.
>> 
>> 
>> 
>> System : Intel(R) Xeon(R) Platinum 8280 CPU @ 2.70GHz [28C 2S Cascadelake Server]
>> 
>> Baseline:
>> Benchmark                                          (inSize)  (outSize)   Mode  Cnt    Score   Error   Units
>> LoadMaskedIOOBEBenchmark.byteLoadArrayMaskIOOBE        1026       1152  thrpt    2  712.218          ops/ms
>> LoadMaskedIOOBEBenchmark.doubleLoadArrayMaskIOOBE      1026       1152  thrpt    2  156.912          ops/ms
>> LoadMaskedIOOBEBenchmark.floatLoadArrayMaskIOOBE       1026       1152  thrpt    2  255.814          ops/ms
>> LoadMaskedIOOBEBenchmark.intLoadArrayMaskIOOBE         1026       1152  thrpt    2  267.688          ops/ms
>> LoadMaskedIOOBEBenchmark.longLoadArrayMaskIOOBE        1026       1152  thrpt    2  140.957          ops/ms
>> LoadMaskedIOOBEBenchmark.shortLoadArrayMaskIOOBE       1026       1152  thrpt    2  474.009          ops/ms
>> 
>> 
>> With Opt:
>> Benchmark                                          (inSize)  (outSize)   Mode  Cnt     Score   Error   Units
>> LoadMaskedIOOBEBenchmark.byteLoadArrayMaskIOOBE        1026       1152  thrpt    2   742.781          ops/ms
>> LoadMaskedIOOBEBenchmark.doubleLoadArrayMaskIOOBE      1026       1152  thrpt    2  1241.021          ops/ms
>> LoadMaskedIOOBEBenchmark.floatLoadArrayMaskIOOBE       1026       1152  thrpt    2  2333.311          ops/ms
>> LoadMaskedIOOBEBenchmark.intLoadArrayMaskIOOBE         1026       1152  thrpt    2  3258.754          ops/ms
>> LoadMaskedIOOBEBenchmark.longLoadArrayMaskIOOBE        1026       1152  thrpt    2  1757.192          ops/ms
>> LoadMaskedIOOBEBenchmark.shortLoadArrayMaskIOOBE       1026       1152  thrpt    2   472.590          ops/ms
>> 
>> 
>> Predicated memory operation over sub-word type will be handled in a subsequent patch. 
>> 
>> Kindly review and share your feedback.
>> 
>> Best Regards,
>> Jatin
>
> Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision:
> 
>   8289186: Review comments resolved.

src/hotspot/cpu/x86/x86.ad line 1767:

> 1765:       break;
> 1766:     case Op_StoreVectorMasked:
> 1767:       if (!VM_Version::supports_avx512bw() && (is_subword_type(bt) || UseAVX < 1)) {

Why duplicate the same check?
And you did not answer my suggestion about modifying the check.

src/hotspot/share/opto/vectorIntrinsics.cpp line 313:

> 311:     if (!is_supported && (sopc == Op_StoreVectorMasked || sopc == Op_LoadVectorMasked)) {
> 312:       return true;
> 313:     }

Still unclear for me. As I understand `the upfront checks`, you mention, are checks at lines 270 and 286. So we know that `StoreVectorMasked` and `LoadVectorMasked` are supported.
The second part of your comment is talking about `non-predicated targets`. Does it mean that the real check should be next?:

   if (Matcher::has_predicated_vectors()) {
      ....
   } else if (sopc == Op_StoreVectorMasked || sopc == Op_LoadVectorMasked) {
     // your comment
     return true;
   }
   ```
   Or I am still missing what are you trying to do here?

src/hotspot/share/opto/vectornode.hpp line 907:

> 905:   StoreVectorMaskedNode(Node* c, Node* mem, Node* dst, Node* src, const TypePtr* at, Node* mask)
> 906:    : StoreVectorNode(c, mem, dst, at, src) {
> 907:     assert(mask->bottom_type()->isa_vectmask(), "sanity");

Why the assert was added before? And why you can remove it now?

src/jdk.incubator.vector/share/classes/jdk/incubator/vector/ByteVector.java line 3923:

> 3921:     // End of low-level memory operations.
> 3922: 
> 3923:     @ForceInline

Why this change? Was it missing before or you found that based on testing?
What is criteria to add `@ForceInline`?

-------------

PR: https://git.openjdk.org/jdk/pull/9324