RFR: 8303762: Optimize vector slice operation with constant index using VPALIGNR instruction [v2]
Jatin Bhateja
jbhateja at openjdk.org
Mon Jul 28 05:55:55 UTC 2025
On Fri, 25 Jul 2025 20:09:40 GMT, Jatin Bhateja <jbhateja at openjdk.org> wrote:
>> Patch optimizes Vector. slice operation with constant index using x86 ALIGNR instruction.
>> It also adds a new hybrid call generator to facilitate lazy intrinsification or else perform procedural inlining to prevent call overhead and boxing penalties in case the fallback implementation expects to operate over vectors. The existing vector API-based slice implementation is now the fallback code that gets inlined in case intrinsification fails.
>>
>> Idea here is to add infrastructure support to enable intrinsification of fast path for selected vector APIs, else enable inlining of fall-back implementation if it's based on vector APIs. Existing call generators like PredictedCallGenerator, used to handle bi-morphic inlining, already make use of multiple call generators to handle hit/miss scenarios for a particular receiver type. The newly added hybrid call generator is lazy and called during incremental inlining optimization. It also relieves the inline expander to handle slow paths, which can easily be implemented library side (Java).
>>
>> Vector API jtreg tests pass at AVX level 2, remaining validation in progress.
>>
>> Performance numbers:
>>
>>
>> System : 13th Gen Intel(R) Core(TM) i3-1315U
>>
>> Baseline:
>> Benchmark (size) Mode Cnt Score Error Units
>> VectorSliceBenchmark.byteVectorSliceWithConstantIndex1 1024 thrpt 2 9444.444 ops/ms
>> VectorSliceBenchmark.byteVectorSliceWithConstantIndex2 1024 thrpt 2 10009.319 ops/ms
>> VectorSliceBenchmark.byteVectorSliceWithVariableIndex 1024 thrpt 2 9081.926 ops/ms
>> VectorSliceBenchmark.intVectorSliceWithConstantIndex1 1024 thrpt 2 6085.825 ops/ms
>> VectorSliceBenchmark.intVectorSliceWithConstantIndex2 1024 thrpt 2 6505.378 ops/ms
>> VectorSliceBenchmark.intVectorSliceWithVariableIndex 1024 thrpt 2 6204.489 ops/ms
>> VectorSliceBenchmark.longVectorSliceWithConstantIndex1 1024 thrpt 2 1651.334 ops/ms
>> VectorSliceBenchmark.longVectorSliceWithConstantIndex2 1024 thrpt 2 1642.784 ops/ms
>> VectorSliceBenchmark.longVectorSliceWithVariableIndex 1024 thrpt 2 1474.808 ops/ms
>> VectorSliceBenchmark.shortVectorSliceWithConstantIndex1 1024 thrpt 2 10399.394 ops/ms
>> VectorSliceBenchmark.shortVectorSliceWithConstantIndex2 1024 thrpt 2 10502.894 ops/ms
>> VectorSliceB...
>
> Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision:
>
> Updating predicate checks
Performance on AVX512 machine
Baseline:
Benchmark (size) Mode Cnt Score Error Units
VectorSliceBenchmark.byteVectorSliceWithConstantIndex1 1024 thrpt 4 35741.780 ± 1561.065 ops/ms
VectorSliceBenchmark.byteVectorSliceWithConstantIndex2 1024 thrpt 4 35011.929 ± 5886.902 ops/ms
VectorSliceBenchmark.byteVectorSliceWithVariableIndex 1024 thrpt 4 32366.844 ± 1489.449 ops/ms
VectorSliceBenchmark.intVectorSliceWithConstantIndex1 1024 thrpt 4 10636.281 ± 608.705 ops/ms
VectorSliceBenchmark.intVectorSliceWithConstantIndex2 1024 thrpt 4 10750.833 ± 328.997 ops/ms
VectorSliceBenchmark.intVectorSliceWithVariableIndex 1024 thrpt 4 10257.338 ± 2027.422 ops/ms
VectorSliceBenchmark.longVectorSliceWithConstantIndex1 1024 thrpt 4 5362.330 ± 4199.651 ops/ms
VectorSliceBenchmark.longVectorSliceWithConstantIndex2 1024 thrpt 4 4992.399 ± 6053.641 ops/ms
VectorSliceBenchmark.longVectorSliceWithVariableIndex 1024 thrpt 4 4941.258 ± 478.193 ops/ms
VectorSliceBenchmark.shortVectorSliceWithConstantIndex1 1024 thrpt 4 40432.828 ± 26672.673 ops/ms
VectorSliceBenchmark.shortVectorSliceWithConstantIndex2 1024 thrpt 4 41300.811 ± 34342.482 ops/ms
VectorSliceBenchmark.shortVectorSliceWithVariableIndex 1024 thrpt 4 36958.309 ± 1899.676 ops/ms
Withopt:
Benchmark (size) Mode Cnt Score Error Units
VectorSliceBenchmark.byteVectorSliceWithConstantIndex1 1024 thrpt 10 67936.711 ± 389.783 ops/ms
VectorSliceBenchmark.byteVectorSliceWithConstantIndex2 1024 thrpt 10 70086.731 ± 5972.968 ops/ms
VectorSliceBenchmark.byteVectorSliceWithVariableIndex 1024 thrpt 10 31879.187 ± 148.213 ops/ms
VectorSliceBenchmark.intVectorSliceWithConstantIndex1 1024 thrpt 10 17676.883 ± 217.238 ops/ms
VectorSliceBenchmark.intVectorSliceWithConstantIndex2 1024 thrpt 10 16983.007 ± 3988.548 ops/ms
VectorSliceBenchmark.intVectorSliceWithVariableIndex 1024 thrpt 10 9851.266 ± 31.773 ops/ms
VectorSliceBenchmark.longVectorSliceWithConstantIndex1 1024 thrpt 10 9194.216 ± 42.772 ops/ms
VectorSliceBenchmark.longVectorSliceWithConstantIndex2 1024 thrpt 10 8411.738 ± 33.209 ops/ms
VectorSliceBenchmark.longVectorSliceWithVariableIndex 1024 thrpt 10 5244.850 ± 12.214 ops/ms
VectorSliceBenchmark.shortVectorSliceWithConstantIndex1 1024 thrpt 10 61233.526 ± 20472.895 ops/ms
VectorSliceBenchmark.shortVectorSliceWithConstantIndex2 1024 thrpt 10 61545.276 ± 20722.066 ops/ms
VectorSliceBenchmark.shortVectorSliceWithVariableIndex 1024 thrpt 10 41208.718 ± 5374.829 ops/ms
-------------
PR Comment: https://git.openjdk.org/jdk/pull/24104#issuecomment-3125629912
More information about the core-libs-dev
mailing list