RFR: 8303762: Optimize vector slice operation with constant index using VPALIGNR instruction [v2]

Mon Jul 28 05:55:55 UTC 2025

On Fri, 25 Jul 2025 20:09:40 GMT, Jatin Bhateja <jbhateja at openjdk.org> wrote:

>> Patch optimizes Vector. slice operation with constant index using x86 ALIGNR instruction.
>> It also adds a new hybrid call generator to facilitate lazy intrinsification or else perform procedural inlining to prevent call overhead and boxing penalties in case the fallback implementation expects to operate over vectors. The existing vector API-based slice implementation is now the fallback code that gets inlined in case intrinsification fails.
>> 
>>  Idea here is to add infrastructure support to enable intrinsification of fast path for selected vector APIs, else enable inlining of fall-back implementation if it's based on vector APIs. Existing call generators like PredictedCallGenerator, used to handle bi-morphic inlining, already make use of multiple call generators to handle hit/miss scenarios for a particular receiver type. The newly added hybrid call generator is lazy and called during incremental inlining optimization. It also relieves the inline expander to handle slow paths, which can easily be implemented library side (Java).
>> 
>> Vector API jtreg tests pass at AVX level 2, remaining validation in progress.
>> 
>> Performance numbers:
>> 
>> 
>> System : 13th Gen Intel(R) Core(TM) i3-1315U
>> 
>> Baseline:
>> Benchmark                                                (size)   Mode  Cnt      Score   Error   Units
>> VectorSliceBenchmark.byteVectorSliceWithConstantIndex1     1024  thrpt    2   9444.444          ops/ms
>> VectorSliceBenchmark.byteVectorSliceWithConstantIndex2     1024  thrpt    2  10009.319          ops/ms
>> VectorSliceBenchmark.byteVectorSliceWithVariableIndex      1024  thrpt    2   9081.926          ops/ms
>> VectorSliceBenchmark.intVectorSliceWithConstantIndex1      1024  thrpt    2   6085.825          ops/ms
>> VectorSliceBenchmark.intVectorSliceWithConstantIndex2      1024  thrpt    2   6505.378          ops/ms
>> VectorSliceBenchmark.intVectorSliceWithVariableIndex       1024  thrpt    2   6204.489          ops/ms
>> VectorSliceBenchmark.longVectorSliceWithConstantIndex1     1024  thrpt    2   1651.334          ops/ms
>> VectorSliceBenchmark.longVectorSliceWithConstantIndex2     1024  thrpt    2   1642.784          ops/ms
>> VectorSliceBenchmark.longVectorSliceWithVariableIndex      1024  thrpt    2   1474.808          ops/ms
>> VectorSliceBenchmark.shortVectorSliceWithConstantIndex1    1024  thrpt    2  10399.394          ops/ms
>> VectorSliceBenchmark.shortVectorSliceWithConstantIndex2    1024  thrpt    2  10502.894          ops/ms
>> VectorSliceB...
>
> Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Updating predicate checks

Performance on AVX512 machine

Baseline:
Benchmark                                                (size)   Mode  Cnt      Score       Error   Units
VectorSliceBenchmark.byteVectorSliceWithConstantIndex1     1024  thrpt    4  35741.780 ±  1561.065  ops/ms
VectorSliceBenchmark.byteVectorSliceWithConstantIndex2     1024  thrpt    4  35011.929 ±  5886.902  ops/ms
VectorSliceBenchmark.byteVectorSliceWithVariableIndex      1024  thrpt    4  32366.844 ±  1489.449  ops/ms
VectorSliceBenchmark.intVectorSliceWithConstantIndex1      1024  thrpt    4  10636.281 ±   608.705  ops/ms
VectorSliceBenchmark.intVectorSliceWithConstantIndex2      1024  thrpt    4  10750.833 ±   328.997  ops/ms
VectorSliceBenchmark.intVectorSliceWithVariableIndex       1024  thrpt    4  10257.338 ±  2027.422  ops/ms
VectorSliceBenchmark.longVectorSliceWithConstantIndex1     1024  thrpt    4   5362.330 ±  4199.651  ops/ms
VectorSliceBenchmark.longVectorSliceWithConstantIndex2     1024  thrpt    4   4992.399 ±  6053.641  ops/ms
VectorSliceBenchmark.longVectorSliceWithVariableIndex      1024  thrpt    4   4941.258 ±   478.193  ops/ms
VectorSliceBenchmark.shortVectorSliceWithConstantIndex1    1024  thrpt    4  40432.828 ± 26672.673  ops/ms
VectorSliceBenchmark.shortVectorSliceWithConstantIndex2    1024  thrpt    4  41300.811 ± 34342.482  ops/ms
VectorSliceBenchmark.shortVectorSliceWithVariableIndex     1024  thrpt    4  36958.309 ±  1899.676  ops/ms

Withopt:
Benchmark                                                (size)   Mode  Cnt      Score       Error   Units
VectorSliceBenchmark.byteVectorSliceWithConstantIndex1     1024  thrpt   10  67936.711 ±   389.783  ops/ms
VectorSliceBenchmark.byteVectorSliceWithConstantIndex2     1024  thrpt   10  70086.731 ±  5972.968  ops/ms
VectorSliceBenchmark.byteVectorSliceWithVariableIndex      1024  thrpt   10  31879.187 ±   148.213  ops/ms
VectorSliceBenchmark.intVectorSliceWithConstantIndex1      1024  thrpt   10  17676.883 ±   217.238  ops/ms
VectorSliceBenchmark.intVectorSliceWithConstantIndex2      1024  thrpt   10  16983.007 ±  3988.548  ops/ms
VectorSliceBenchmark.intVectorSliceWithVariableIndex       1024  thrpt   10   9851.266 ±    31.773  ops/ms
VectorSliceBenchmark.longVectorSliceWithConstantIndex1     1024  thrpt   10   9194.216 ±    42.772  ops/ms
VectorSliceBenchmark.longVectorSliceWithConstantIndex2     1024  thrpt   10   8411.738 ±    33.209  ops/ms
VectorSliceBenchmark.longVectorSliceWithVariableIndex      1024  thrpt   10   5244.850 ±    12.214  ops/ms
VectorSliceBenchmark.shortVectorSliceWithConstantIndex1    1024  thrpt   10  61233.526 ± 20472.895  ops/ms
VectorSliceBenchmark.shortVectorSliceWithConstantIndex2    1024  thrpt   10  61545.276 ± 20722.066  ops/ms
VectorSliceBenchmark.shortVectorSliceWithVariableIndex     1024  thrpt   10  41208.718 ±  5374.829  ops/ms

-------------

PR Comment: https://git.openjdk.org/jdk/pull/24104#issuecomment-3125629912