RFR: 8289186: Support predicated vector load/store operations over X86 AVX2 targets.

Thu Jun 30 02:07:44 UTC 2022

On Wed, 29 Jun 2022 09:07:48 GMT, Jatin Bhateja <jbhateja at openjdk.org> wrote:

> Hi All,
> 
> [JDK-8283667](https://bugs.openjdk.org/browse/JDK-8283667) added the support to handle masked loads on non-predicated targets by blending the loaded contents with zero vector iff unmasked portion of load does not span beyond array bounds.
> 
> X86 AVX2 offers direct predicated vector loads/store instruction for non-sub word type.
> 
> This patch adds the efficient backend implementation for predicated memory operations over int/long/float/double vectors.
> 
> Please find below the JMH micro stats with and without patch.
> 
> 
> 
> System : Intel(R) Xeon(R) Platinum 8280 CPU @ 2.70GHz [28C 2S Cascadelake Server]
> 
> Baseline:
> Benchmark                                          (inSize)  (outSize)   Mode  Cnt    Score   Error   Units
> LoadMaskedIOOBEBenchmark.byteLoadArrayMaskIOOBE        1026       1152  thrpt    2  712.218          ops/ms
> LoadMaskedIOOBEBenchmark.doubleLoadArrayMaskIOOBE      1026       1152  thrpt    2  156.912          ops/ms
> LoadMaskedIOOBEBenchmark.floatLoadArrayMaskIOOBE       1026       1152  thrpt    2  255.814          ops/ms
> LoadMaskedIOOBEBenchmark.intLoadArrayMaskIOOBE         1026       1152  thrpt    2  267.688          ops/ms
> LoadMaskedIOOBEBenchmark.longLoadArrayMaskIOOBE        1026       1152  thrpt    2  140.957          ops/ms
> LoadMaskedIOOBEBenchmark.shortLoadArrayMaskIOOBE       1026       1152  thrpt    2  474.009          ops/ms
> 
> 
> With Opt:
> Benchmark                                          (inSize)  (outSize)   Mode  Cnt     Score   Error   Units
> LoadMaskedIOOBEBenchmark.byteLoadArrayMaskIOOBE        1026       1152  thrpt    2   742.781          ops/ms
> LoadMaskedIOOBEBenchmark.doubleLoadArrayMaskIOOBE      1026       1152  thrpt    2  1241.021          ops/ms
> LoadMaskedIOOBEBenchmark.floatLoadArrayMaskIOOBE       1026       1152  thrpt    2  2333.311          ops/ms
> LoadMaskedIOOBEBenchmark.intLoadArrayMaskIOOBE         1026       1152  thrpt    2  3258.754          ops/ms
> LoadMaskedIOOBEBenchmark.longLoadArrayMaskIOOBE        1026       1152  thrpt    2  1757.192          ops/ms
> LoadMaskedIOOBEBenchmark.shortLoadArrayMaskIOOBE       1026       1152  thrpt    2   472.590          ops/ms
> 
> 
> Predicated memory operation over sub-word type will be handled in a subsequent patch. 
> 
> Kindly review and share your feedback.
> 
> Best Regards,
> Jatin

src/hotspot/cpu/x86/x86.ad line 1762:

> 1760:       break;
> 1761:     case Op_LoadVectorMasked:
> 1762:       if (!VM_Version::supports_avx512bw() && (is_subword_type(bt) || UseAVX < 1)) {

With `UseAVX=0` we clear `supports_avx512bw`. So the test should be 

if (!VM_Version::supports_avx512bw() && is_subword_type(bt) || UseAVX < 1)

And may be naive question. Is VectorMaskGen is used for `mask` node  creation? If so, why to have separate support checks for `LoadVectorMasked/StoreVectorMasked`?

src/hotspot/share/opto/vectorIntrinsics.cpp line 313:

> 311:       return true;
> 312:     }
> 313: 

Why it is placed here without `is_supported` check?  Comment does not explain it.

-------------

PR: https://git.openjdk.org/jdk/pull/9324