RFR: 8303762: Optimize vector slice operation with constant index using VPALIGNR instruction [v13]

Jatin Bhateja jbhateja at openjdk.org
Tue Feb 24 10:10:22 UTC 2026


On Mon, 9 Feb 2026 21:36:16 GMT, Jatin Bhateja <jbhateja at openjdk.org> wrote:

>> Patch optimizes Vector. slice operation with constant index using x86 ALIGNR instruction.
>> It also adds a new hybrid call generator to facilitate lazy intrinsification or else perform procedural inlining to prevent call overhead and boxing penalties in case the fallback implementation expects to operate over vectors. The existing vector API-based slice implementation is now the fallback code that gets inlined in case intrinsification fails.
>> 
>>  Idea here is to add infrastructure support to enable intrinsification of fast path for selected vector APIs, else enable inlining of fall-back implementation if it's based on vector APIs. Existing call generators like PredictedCallGenerator, used to handle bi-morphic inlining, already make use of multiple call generators to handle hit/miss scenarios for a particular receiver type. The newly added hybrid call generator is lazy and called during incremental inlining optimization. It also relieves the inline expander to handle slow paths, which can easily be implemented library side (Java).
>> 
>> Vector API jtreg tests pass at AVX level 2, remaining validation in progress.
>> 
>> Performance numbers:
>> 
>> 
>> System : 13th Gen Intel(R) Core(TM) i3-1315U
>> 
>> Baseline:
>> Benchmark                                                (size)   Mode  Cnt      Score   Error   Units
>> VectorSliceBenchmark.byteVectorSliceWithConstantIndex1     1024  thrpt    2   9444.444          ops/ms
>> VectorSliceBenchmark.byteVectorSliceWithConstantIndex2     1024  thrpt    2  10009.319          ops/ms
>> VectorSliceBenchmark.byteVectorSliceWithVariableIndex      1024  thrpt    2   9081.926          ops/ms
>> VectorSliceBenchmark.intVectorSliceWithConstantIndex1      1024  thrpt    2   6085.825          ops/ms
>> VectorSliceBenchmark.intVectorSliceWithConstantIndex2      1024  thrpt    2   6505.378          ops/ms
>> VectorSliceBenchmark.intVectorSliceWithVariableIndex       1024  thrpt    2   6204.489          ops/ms
>> VectorSliceBenchmark.longVectorSliceWithConstantIndex1     1024  thrpt    2   1651.334          ops/ms
>> VectorSliceBenchmark.longVectorSliceWithConstantIndex2     1024  thrpt    2   1642.784          ops/ms
>> VectorSliceBenchmark.longVectorSliceWithVariableIndex      1024  thrpt    2   1474.808          ops/ms
>> VectorSliceBenchmark.shortVectorSliceWithConstantIndex1    1024  thrpt    2  10399.394          ops/ms
>> VectorSliceBenchmark.shortVectorSliceWithConstantIndex2    1024  thrpt    2  10502.894          ops/ms
>> VectorSliceB...
>
> Jatin Bhateja has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 16 commits:
> 
>  - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8303762
>  - Review comments resolutions
>  - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8303762
>  - Review comments resolutions
>  - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8303762
>  - Update callGenerator.hpp copyright year
>  - Review comments resolution
>  - Cleanups
>  - Review comments resolutions
>  - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8303762
>  - ... and 6 more: https://git.openjdk.org/jdk/compare/ffb6279c...1dfff558

> I’m fine with using the more straightforward approach to intrinsify the slice API when the origin is a constant. In my view, this could also benefit other APIs and future optimizations (for example, #28520), since slice is a general vector operation. Relying on pattern matching makes the compiler implementation significantly more complex in my opinion.
> 
> Regarding inlining of the fallback implementation, I think we do need such a mechanism to handle APIs that fail to inline on the first attempt, given that the current fallback overhead is much heavier and leads to worse performance. And I agree with @merykitty that a more generic solution would be more preferable.

Hi @merykitty , @XiaohongGong , Based on the feedback received, I have modified the patch to not inline on first intrinsic failure, instead I now collect such CallGenerators and only towards the end on incremental inlining I inline expands the fallback implementation on the lines of _string_late_inlines.

This will give opportunity to create constant context for VectorSlice intrinsification and if that fails we inline the fallback implimentation to avoid any costly boxing penalties.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/24104#issuecomment-3950522542


More information about the core-libs-dev mailing list