Integrated: 8350126: Regression ~3% on Crypto-ChaCha20Poly1305.encrypt for MacOSX aarch64

Jamil Nimeh jnimeh at openjdk.org
Tue Apr 22 16:52:49 UTC 2025


On Thu, 3 Apr 2025 16:31:39 GMT, Jamil Nimeh <jnimeh at openjdk.org> wrote:

> This fix addresses a performance regression found on some aarch64 processors, namely the Apple M1, when we moved to a quarter round parallel implementation in JDK-8349106.  After making some improvements in the ordering of the instructions in the 20-round loop we found that going back to a block-parallel implementation was faster, but it definitely needed the ordering changes for that to be the case.  More importantly, the block parallel implementation with the interleaving turns out to be faster on even those processors that showed improvements when moving to the quarter round parallel implementation.
> 
> There is a spreadsheet attached to the JBS bug that shows 3 different implementations relative to the current (QR-parallel with no interleaving) implementation on 3 different ARM64 processors.  Comparative benchmarks can also be found below.

This pull request has now been integrated.

Changeset: 594b2651
Author:    Jamil Nimeh <jnimeh at openjdk.org>
URL:       https://git.openjdk.org/jdk/commit/594b26516e5c01d7daa331db59bdbe8ab7dc0a6d
Stats:     395 lines in 3 files changed: 137 ins; 80 del; 178 mod

8350126: Regression ~3% on Crypto-ChaCha20Poly1305.encrypt for MacOSX aarch64

Reviewed-by: aph

-------------

PR: https://git.openjdk.org/jdk/pull/24420


More information about the hotspot-dev mailing list