RFR: 8350748: VectorAPI: Method "checkMaskFromIndexSize" should be force inlined

Thu Feb 27 23:33:04 UTC 2025

On Thu, 27 Feb 2025 06:43:19 GMT, Xiaohong Gong <xgong at openjdk.org> wrote:

> Method `checkMaskFromIndexSize` is called by some vector masked APIs like `fromArray/intoArray/fromMemorySegment/...`. It is used to check whether the index of any active lanes in a mask will reach out of the boundary of the given Array/MemorySegment. This function should be force inlined, or a VectorMask object is generated once the function call is not inlined by C2 compiler, which affects the API performance a lot.
> 
> This patch changed to call the `VectorMask.checkFromIndexSize` method directly inside of these APIs instead of `checkMaskFromIndexSize`. Since it has added the `@ForceInline` annotation already, it will be inlined and intrinsified by C2. And then the expected vector instructions can be generated. With this change, the unused `checkMaskFromIndexSize` can be removed.
> 
> Performance of some JMH benchmarks can improve up to 14x on a NVIDIA Grace CPU (AArch64 SVE2, 128-bit vectors). We can also observe the similar performance improvement on a Intel CPU which supports AVX512.
> 
> Following is the performance data on Grace:
> 
> 
> Benchmark                                             Mode  Cnt  Units     Before      After   Gain
> LoadMaskedIOOBEBenchmark.byteLoadArrayMaskIOOBE      thrpt   30  ops/ms  31544.304  31610.598  1.002
> LoadMaskedIOOBEBenchmark.doubleLoadArrayMaskIOOBE    thrpt   30  ops/ms   3896.202   3903.249  1.001
> LoadMaskedIOOBEBenchmark.floatLoadArrayMaskIOOBE     thrpt   30  ops/ms    570.415   7174.320  12.57
> LoadMaskedIOOBEBenchmark.intLoadArrayMaskIOOBE       thrpt   30  ops/ms    566.694   7193.520  12.69
> LoadMaskedIOOBEBenchmark.longLoadArrayMaskIOOBE      thrpt   30  ops/ms   3899.269   3878.258  0.994
> LoadMaskedIOOBEBenchmark.shortLoadArrayMaskIOOBE     thrpt   30  ops/ms   1134.301  16053.847  14.15
> StoreMaskedIOOBEBenchmark.byteStoreArrayMaskIOOBE    thrpt   30  ops/ms  26449.558  28699.480  1.085
> StoreMaskedIOOBEBenchmark.doubleStoreArrayMaskIOOBE  thrpt   30  ops/ms   1922.167   5781.077  3.007
> StoreMaskedIOOBEBenchmark.floatStoreArrayMaskIOOBE   thrpt   30  ops/ms   3784.190  11789.276  3.115
> StoreMaskedIOOBEBenchmark.intStoreArrayMaskIOOBE     thrpt   30  ops/ms   3694.082  15633.547  4.232
> StoreMaskedIOOBEBenchmark.longStoreArrayMaskIOOBE    thrpt   30  ops/ms   1966.956   6049.790  3.075
> StoreMaskedIOOBEBenchmark.shortStoreArrayMaskIOOBE   thrpt   30  ops/ms   7647.309  27412.387  3.584

Marked as reviewed by psandoz (Reviewer).

-------------

PR Review: https://git.openjdk.org/jdk/pull/23817#pullrequestreview-2649322190