RFR: 8349721: Add aarch64 intrinsics for ML-KEM [v7]

Thu Apr 10 14:42:35 UTC 2025

On Thu, 10 Apr 2025 13:19:05 GMT, Ferenc Rakoczi <duke at openjdk.org> wrote:

>> By using the aarch64 vector registers the speed of the computation of the ML-KEM algorithms (key generation, encapsulation, decapsulation) can be approximately doubled.
>
> Ferenc Rakoczi has updated the pull request incrementally with two additional commits since the last revision:
> 
>  - Code rearrange, some renaming, fixing comments
>  - Changes suggested by Andrew Dinn.

src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 5300:

> 5298:     // level 5
> 5299:     vs_ldpq(vq, kyberConsts);
> 5300:     int offsets4[4] = { 0, 32, 64, 96 };

Again a comment
// At level 5 related coefficients occur in discrete blocks of size 8 so
// need to be loaded interleaved using an ld2 operation with arrangement 2D

src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 5319:

> 5317:     vs_st2_indexed(vs1, __ T2D, coeffs, tmpAddr, 384, offsets4);
> 5318: 
> 5319:     // level 6

And again
// At level 6 related coefficients occur in discrete blocks of size 4 so
// need to be loaded interleaved using an ld2 operation with arrangement 4S

src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 5377:

> 5375:     // level 0
> 5376:     vs_ldpq(vq, kyberConsts);
> 5377:     int offsets4[4] = { 0, 32, 64, 96 };

Again a comment
// At level 0 related coefficients occur in discrete blocks of size 4 so
// need to be loaded interleaved using an ld2 operation with arrangement 4S

src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 5399:

> 5397:     vs_st2_indexed(vs1, __ T4S, coeffs, tmpAddr, 384, offsets4);
> 5398: 
> 5399:     // level 1

Again a comment
// At level 1 related coefficients occur in discrete blocks of size 8 so
// need to be loaded interleaved using an ld2 operation with arrangement 2D

src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 5423:

> 5421: 
> 5422:     // level 2
> 5423:     int offsets3[8] = { 0, 32, 64, 96, 128, 160, 192, 224 };

Again
// At level 2 coefficients occur in 8 discrete blocks of size 16
// so they are loaded using employing an ldr at 8 distinct offsets.

src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 5464:

> 5462:     vs_str_indexed(vs1, __ Q, coeffs, 256, offsets3);
> 5463: 
> 5464:     // level 3

/ From level 3 upwards coefficients occur in discrete blocks whose size is
// some multiple of 32 so can be loaded using ldpq and suitable indexes.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23663#discussion_r2037571231
PR Review Comment: https://git.openjdk.org/jdk/pull/23663#discussion_r2037573218
PR Review Comment: https://git.openjdk.org/jdk/pull/23663#discussion_r2037577265
PR Review Comment: https://git.openjdk.org/jdk/pull/23663#discussion_r2037578385
PR Review Comment: https://git.openjdk.org/jdk/pull/23663#discussion_r2037581149
PR Review Comment: https://git.openjdk.org/jdk/pull/23663#discussion_r2037585101