RFR: 8376164: Optimize AES/ECB/PKCS5Padding implementation using full-message intrinsic stub and parallel RoundKey addition [v2]

Thu Feb 19 03:03:03 UTC 2026

On Mon, 9 Feb 2026 21:27:55 GMT, xinyangwu <duke at openjdk.org> wrote:

>> ### Summary
>> This PR introduces a parallel intrinsic for AES/ECB operations to replace the current per-block processing approach, reducing native call overhead and improving throughput for multi-block operations.
>> ### Problem
>> Except supporting AVX512, The existing AES/ECB/PKCS5Padding implementation suffers from three major performance issues:
>> 1. Excessive stub call overhead: Each 16-byte block requires a separate intrinsic call, resulting in high invocation frequency
>> 
>> 2. Inefficient instruction-level parallelism: The serialized block processing fails to fully utilize instruction-level parallelism
>> 
>> 3. Redundant setup/teardown: Repeated initialization of encryption state for each block
>> ### Changes
>> Added parallel AES intrinsic implementation
>> ### Testing
>> JMH benchmarks
>> 
>> It can bring about a **37.43%** performance improvement.
>> 
>> On a Intel(R) Core(TM) i9-14900HX CPU machine with origin implements:
>> 
>> 
>> Benchmark     Mode  Cnt      Score    Error  Units
>> AesTest.test  avgt    5  11518.846 ± 68.621  ns/op
>> 
>> 
>> On the same machine with optimized implements:
>> 
>> 
>> Benchmark     Mode  Cnt     Score    Error  Units
>> AesTest.test  avgt    5  8381.499 ± 57.751  ns/op
>> 
>> 
>> All Tier-1 tests pass on linux-x64. This modification does not involve changing the encryption or decryption logic.
>
> xinyangwu has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains two additional commits since the last revision:
> 
>  - Merge branch 'openjdk:master' into aes
>  - 8376164: Optimize AES/ECB/PKCS5Padding with parallel intrinsic

src/hotspot/cpu/x86/stubGenerator_x86_64_aes.cpp line 1404:

> 1402: }
> 1403: 
> 1404: address StubGenerator::generate_electronicCodeBook_encryptAESCrypt_multiBlock_Parallel() {

Would be nice to have a method description (input, outputs, which registers are used on which platforms) and the overall algorithm used for multi-block encryption/decryption.

src/hotspot/cpu/x86/stubGenerator_x86_64_aes.cpp line 1449:

> 1447:     __ opc(xmm_result2, reg); \
> 1448:     __ opc(xmm_result3, reg); \
> 1449:   } while (0)

Can't this just be?:
#define DoFour(opc, reg)   \
__ opc(xmm_result0, reg); \
__ opc(xmm_result1, reg); \
__ opc(xmm_result2, reg); \
__ opc(xmm_result3, reg); \

src/hotspot/cpu/x86/stubGenerator_x86_64_aes.cpp line 1451:

> 1449:   } while (0)
> 1450: 
> 1451: #define DoOne(opc, reg)         \

Same question as above.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/29385#discussion_r2825449292
PR Review Comment: https://git.openjdk.org/jdk/pull/29385#discussion_r2825448699
PR Review Comment: https://git.openjdk.org/jdk/pull/29385#discussion_r2825448851