RFR: 8289186: Support predicated vector load/store operations over X86 AVX2 targets.

Jatin Bhateja jbhateja at openjdk.org
Mon Jul 4 15:46:55 UTC 2022


On Thu, 30 Jun 2022 02:03:47 GMT, Vladimir Kozlov <kvn at openjdk.org> wrote:

>> Hi All,
>> 
>> [JDK-8283667](https://bugs.openjdk.org/browse/JDK-8283667) added the support to handle masked loads on non-predicated targets by blending the loaded contents with zero vector iff unmasked portion of load does not span beyond array bounds.
>> 
>> X86 AVX2 offers direct predicated vector loads/store instruction for non-sub word type.
>> 
>> This patch adds the efficient backend implementation for predicated memory operations over int/long/float/double vectors.
>> 
>> Please find below the JMH micro stats with and without patch.
>> 
>> 
>> 
>> System : Intel(R) Xeon(R) Platinum 8280 CPU @ 2.70GHz [28C 2S Cascadelake Server]
>> 
>> Baseline:
>> Benchmark                                          (inSize)  (outSize)   Mode  Cnt    Score   Error   Units
>> LoadMaskedIOOBEBenchmark.byteLoadArrayMaskIOOBE        1026       1152  thrpt    2  712.218          ops/ms
>> LoadMaskedIOOBEBenchmark.doubleLoadArrayMaskIOOBE      1026       1152  thrpt    2  156.912          ops/ms
>> LoadMaskedIOOBEBenchmark.floatLoadArrayMaskIOOBE       1026       1152  thrpt    2  255.814          ops/ms
>> LoadMaskedIOOBEBenchmark.intLoadArrayMaskIOOBE         1026       1152  thrpt    2  267.688          ops/ms
>> LoadMaskedIOOBEBenchmark.longLoadArrayMaskIOOBE        1026       1152  thrpt    2  140.957          ops/ms
>> LoadMaskedIOOBEBenchmark.shortLoadArrayMaskIOOBE       1026       1152  thrpt    2  474.009          ops/ms
>> 
>> 
>> With Opt:
>> Benchmark                                          (inSize)  (outSize)   Mode  Cnt     Score   Error   Units
>> LoadMaskedIOOBEBenchmark.byteLoadArrayMaskIOOBE        1026       1152  thrpt    2   742.781          ops/ms
>> LoadMaskedIOOBEBenchmark.doubleLoadArrayMaskIOOBE      1026       1152  thrpt    2  1241.021          ops/ms
>> LoadMaskedIOOBEBenchmark.floatLoadArrayMaskIOOBE       1026       1152  thrpt    2  2333.311          ops/ms
>> LoadMaskedIOOBEBenchmark.intLoadArrayMaskIOOBE         1026       1152  thrpt    2  3258.754          ops/ms
>> LoadMaskedIOOBEBenchmark.longLoadArrayMaskIOOBE        1026       1152  thrpt    2  1757.192          ops/ms
>> LoadMaskedIOOBEBenchmark.shortLoadArrayMaskIOOBE       1026       1152  thrpt    2   472.590          ops/ms
>> 
>> 
>> Predicated memory operation over sub-word type will be handled in a subsequent patch. 
>> 
>> Kindly review and share your feedback.
>> 
>> Best Regards,
>> Jatin
>
> src/hotspot/cpu/x86/x86.ad line 1762:
> 
>> 1760:       break;
>> 1761:     case Op_LoadVectorMasked:
>> 1762:       if (!VM_Version::supports_avx512bw() && (is_subword_type(bt) || UseAVX < 1)) {
> 
> With `UseAVX=0` we clear `supports_avx512bw`. So the test should be 
> 
> if (!VM_Version::supports_avx512bw() && is_subword_type(bt) || UseAVX < 1)
> 
> 
> And may be naive question. Is VectorMaskGen is used for `mask` node  creation? If so, why to have separate support checks for `LoadVectorMasked/StoreVectorMasked`?

Hi Vladimir, Existing expression gets benefit of short-circuiting else we will need to evaluate two expressions for truly supported case.

As of now VectorMaskGen is used for AVX3 targets, specially for partial in-lining for copy and vectorize compare and post vector loop processing. Partial inlining of copy is only enabled for sub-word types and for AVX2 we do not have sub-word handling yet,  for vectorized compare partial in-lining we use an explicit threshold i.e. array length is >= 16, thus only sub-word types and 512 bit integer species qualify this threshold.

I will be posting a subsequent patch with sub-word handling for masked load/stores over AVX2 after some performance analysis along with maskgen patterns for AVX2.

-------------

PR: https://git.openjdk.org/jdk/pull/9324


More information about the hotspot-compiler-dev mailing list