RFR: 8289186: Support predicated vector load/store operations over X86 AVX2 targets. [v2]

Thu Jul 7 19:01:47 UTC 2022

On Wed, 6 Jul 2022 13:13:08 GMT, Jatin Bhateja <jbhateja at openjdk.org> wrote:

>> src/jdk.incubator.vector/share/classes/jdk/incubator/vector/ByteVector.java line 3923:
>> 
>>> 3921:     // End of low-level memory operations.
>>> 3922: 
>>> 3923:     @ForceInline
>> 
>> Why this change? Was it missing before or you found that based on testing?
>> What is criteria to add `@ForceInline`?
>
> Thanks for highlighting this, checkMaskFromIndexSize is being used to test illegal memory access cases with out-of-range offsets i.e. tail scenarios, thus profile based invocation count will always be low for these calls.  I saw improved performance on targeted micros, but I get your point that aggressive forced in-lining on non-frequently taken paths can have adverse performance side-effects. But then it may overshadow some of the performance gains due to masked load/strores support on tail paths, but we still see a modest gain in order of 2-3x vs original 10x gain for non-sub word types over baseline. 
> 
> 
> Benchmark                                          (inSize)  (outSize)   Mode  Cnt    Score   Error   Units
> LoadMaskedIOOBEBenchmark.byteLoadArrayMaskIOOBE        1026       1152  thrpt    2  748.793          ops/ms
> LoadMaskedIOOBEBenchmark.doubleLoadArrayMaskIOOBE      1026       1152  thrpt    2  381.655          ops/ms
> LoadMaskedIOOBEBenchmark.floatLoadArrayMaskIOOBE       1026       1152  thrpt    2  741.809          ops/ms
> LoadMaskedIOOBEBenchmark.intLoadArrayMaskIOOBE         1026       1152  thrpt    2  757.433          ops/ms
> LoadMaskedIOOBEBenchmark.longLoadArrayMaskIOOBE        1026       1152  thrpt    2  386.450          ops/ms
> LoadMaskedIOOBEBenchmark.shortLoadArrayMaskIOOBE       1026       1152  thrpt    2  471.260          ops/ms

okay.

-------------

PR: https://git.openjdk.org/jdk/pull/9324