RFR: 8349721: Add aarch64 intrinsics for ML-KEM [v7]

Mon Apr 14 13:00:45 UTC 2025

On Thu, 10 Apr 2025 13:19:05 GMT, Ferenc Rakoczi <duke at openjdk.org> wrote:

>> By using the aarch64 vector registers the speed of the computation of the ML-KEM algorithms (key generation, encapsulation, decapsulation) can be approximately doubled.
>
> Ferenc Rakoczi has updated the pull request incrementally with two additional commits since the last revision:
> 
>  - Code rearrange, some renaming, fixing comments
>  - Changes suggested by Andrew Dinn.

src/hotspot/cpu/aarch64/register_aarch64.hpp line 509:

> 507: }
> 508: 
> 509: // convenience methods for splitting 8-way of 4-way vector register

Suggestion:

// convenience methods for splitting 8-way or 4-way vector register

src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 5012:

> 5010:     assert(!va.is_constant(), "output vector must identify 2 different registers");
> 5011: 
> 5012:     // schedule 2 streams of i<nstructions across the vector sequences

Suggestion:

    // schedule 2 streams of instructions across the vector sequences

src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 5166:

> 5164:     //
> 5165:     // On each level, we fill up the vector registers in such a way that the
> 5166:     // array elements that need to be multiplied by the zetas be in one

Suggestion:

    // array elements that need to be multiplied by the zetas are in one

src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 5168:

> 5166:     // array elements that need to be multiplied by the zetas be in one
> 5167:     // set of vector registers while the corresponding ones that don't need to
> 5168:     // be multiplied, in another set. We can do 32 Montgomery multiplications

Suggestion:

    // be multiplied are in another set. We can do 32 Montgomery multiplications

src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 5278:

> 5276:     // level 4
> 5277:     vs_ldpq(vq, kyberConsts);
> 5278:     int offsets3[8] = { 0, 32, 64, 96, 128, 160, 192, 224 };

I'd like to add comment here to explain the coefficient grouping
and likewise at level 5 and 6. So here we have:
Suggestion:

    // Up to level 3 the coefficients multiplied by or added/subtracted
    // to the zetas occur in discrete blocks whose size is some multiple
    // of 32. At level 4 coefficients occur in 8 discrete blocks of size 16
    // so they are loaded using an ldr at 8 distinct offsets.
    int offsets3[8] = { 0, 32, 64, 96, 128, 160, 192, 224 };

src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 5299:

> 5297: 
> 5298:     // level 5
> 5299:     vs_ldpq(vq, kyberConsts);

Suggestion:

    vs_ldpq(vq, kyberConsts);
    // At level 5 related coefficients occur in discrete blocks of size 8 so
    // need to be loaded interleaved using an ld2 operation with arrangement 2D

src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 5319:

> 5317:     vs_st2_indexed(vs1, __ T2D, coeffs, tmpAddr, 384, offsets4);
> 5318: 
> 5319:     // level 6

Suggestion:

    // level 6
    // At level 6 related coefficients occur in discrete blocks of size 4 so
    // need to be loaded interleaved using an ld2 operation with arrangement 4S

src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 5377:

> 5375:     // level 0
> 5376:     vs_ldpq(vq, kyberConsts);
> 5377:     int offsets4[4] = { 0, 32, 64, 96 };

Again a comment
Suggestion:

     // At level 6 related coefficients occur in discrete blocks of size 4 so
     // need to be loaded interleaved using an ld2 operation with arrangement 4S
     int offsets4[4] = { 0, 32, 64, 96 };

src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 5399:

> 5397:     vs_st2_indexed(vs1, __ T4S, coeffs, tmpAddr, 384, offsets4);
> 5398: 
> 5399:     // level 1

Again a comment
Suggestion:

    // level 1
    // At level 1 related coefficients occur in discrete blocks of size 8 so
    // need to be loaded interleaved using an ld2 operation with arrangement 2D

src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 5422:

> 5420:     vs_st2_indexed(vs1, __ T2D, coeffs, tmpAddr, 384, offsets4);
> 5421: 
> 5422:     // level 2

Again
Suggestion:

    // level 2
    // At level 2 coefficients occur in 8 discrete blocks of size 16
    // so they are loaded using employing an ldr at 8 distinct offsets.

src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 5464:

> 5462:     vs_str_indexed(vs1, __ Q, coeffs, 256, offsets3);
> 5463: 
> 5464:     // level 3

Suggestion:

    // level 3
    // From level 3 upwards coefficients occur in discrete blocks whose size is
    // some multiple of 32 so can be loaded using ldpq and suitable indexes.

src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 5591:

> 5589:     store64shorts(vs2, tmpAddr);
> 5590: 
> 5591:     load64shorts(vs1, tmpAddr);

I'd like to make explicit the fact that we have avoided doing an add here (and in the next two cases) by adding a commented out generation step i.e. at this line insert
Suggestion:

    // __ add(tmpAddr, coeffs, 128); // unneeded as implied by preceding store
    load64shorts(vs1, tmpAddr);

src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 5596:

> 5594:     store64shorts(vs2, tmpAddr);
> 5595: 
> 5596:     load64shorts(vs1, tmpAddr);

Likewise insert:
Suggestion:

    // __ add(tmpAddr, coeffs, 256); // unneeded as implied by preceding store
    load64shorts(vs1, tmpAddr);

src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 5601:

> 5599:     store64shorts(vs2, tmpAddr);
> 5600: 
> 5601:     load64shorts(vs1, tmpAddr);

Likewise insert:
Suggestion:

    // __ add(tmpAddr, coeffs, 384); // unneeded as implied by preceding store
    load64shorts(vs1, tmpAddr);

src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 5640:

> 5638:     VSeq<4> vs1(0), vs2(4);  // 4 sets of 8x8H inputs/outputs/tmps
> 5639:     VSeq<4> vs3(16), vs4(20);
> 5640:     VSeq<2> vq(30);          // pair of constants for montmul

Suggestion:

    VSeq<2> vq(30);          // pair of constants for montmul: qinv, q

src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 5642:

> 5640:     VSeq<2> vq(30);          // pair of constants for montmul
> 5641:     VSeq<2> vz(28);          // pair of zetas
> 5642:     VSeq<4> vc(27, 0);       // constant sequence for montmul

Suggestion:

    VSeq<4> vc(27, 0);       // constant for montmul: montRSquareModQ

src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 6094:

> 6092:     VSeq<8> vc2_1(31, 0);
> 6093:     VSeq<2> vc2_2(31, 0);
> 6094:     FloatRegister vc2_3 = v31;

Suggestion:

    // we also need a pair of corresponding constant sequences
    VSeq<8> vc1_1(30, 0);       // kyber_q
    VSeq<2> vc1_2(30, 0);
    FloatRegister vc1_3 = v30;

    VSeq<8> vc2_1(31, 0);       // kyberBarrettMultiplier
    VSeq<2> vc2_2(31, 0);
    FloatRegister vc2_3 = v31;

src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 6102:

> 6100:     // load q and the multiplier for the Barrett reduction
> 6101:     __ add(kyberConsts, kyberConsts, 16);
> 6102:     __ ldpq(vc1_3, vc2_3, kyberConsts);

Suggestion:

    __ ldpq(vc1_3, vc2_3, kyberConsts); // kyber_q, kyberBarrettMultiplier

src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 6111:

> 6109:         __ ldr(vs1_3, __ Q, __ post(coeffs, 16));
> 6110:       }
> 6111:       vs_sqdmulh(vs2_1, __ T8H, vs1_1, vc2_1);

Suggestion:

      vs_sqdmulh(vs2_1, __ T8H, vs1_1, vc2_1); // vs2 <- (2 * vs1 * kyberBarrettMultiplier) >> 16

src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 6116:

> 6114:         __ sqdmulh(vs2_3, __ T8H, vs1_3, vc2_3);
> 6115:       }
> 6116:       vs_sshr(vs2_1, __ T8H, vs2_1, 11);

Suggestion:

      vs_sshr(vs2_1, __ T8H, vs2_1, 11); // vs2 <- (vs1 * kyberBarrettMultiplier) >> 26

src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 6121:

> 6119:         __ sshr(vs2_3, __ T8H, vs2_3, 11);
> 6120:       }
> 6121:       vs_mlsv(vs1_1, __ T8H, vs2_1, vc1_1);

Suggestion:

      vs_mlsv(vs1_1, __ T8H, vs2_1, vc1_1); // vs1 <- vs1 - vs2 * kyber_q

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23663#discussion_r2041886663
PR Review Comment: https://git.openjdk.org/jdk/pull/23663#discussion_r2041888092
PR Review Comment: https://git.openjdk.org/jdk/pull/23663#discussion_r2041889242
PR Review Comment: https://git.openjdk.org/jdk/pull/23663#discussion_r2041889966
PR Review Comment: https://git.openjdk.org/jdk/pull/23663#discussion_r2041892243
PR Review Comment: https://git.openjdk.org/jdk/pull/23663#discussion_r2041892994
PR Review Comment: https://git.openjdk.org/jdk/pull/23663#discussion_r2041893924
PR Review Comment: https://git.openjdk.org/jdk/pull/23663#discussion_r2041895469
PR Review Comment: https://git.openjdk.org/jdk/pull/23663#discussion_r2041896626
PR Review Comment: https://git.openjdk.org/jdk/pull/23663#discussion_r2041898215
PR Review Comment: https://git.openjdk.org/jdk/pull/23663#discussion_r2041899623
PR Review Comment: https://git.openjdk.org/jdk/pull/23663#discussion_r2041902206
PR Review Comment: https://git.openjdk.org/jdk/pull/23663#discussion_r2041902833
PR Review Comment: https://git.openjdk.org/jdk/pull/23663#discussion_r2041906126
PR Review Comment: https://git.openjdk.org/jdk/pull/23663#discussion_r2041909711
PR Review Comment: https://git.openjdk.org/jdk/pull/23663#discussion_r2041929011
PR Review Comment: https://git.openjdk.org/jdk/pull/23663#discussion_r2041867539
PR Review Comment: https://git.openjdk.org/jdk/pull/23663#discussion_r2041868670
PR Review Comment: https://git.openjdk.org/jdk/pull/23663#discussion_r2041881429
PR Review Comment: https://git.openjdk.org/jdk/pull/23663#discussion_r2041883012
PR Review Comment: https://git.openjdk.org/jdk/pull/23663#discussion_r2041883686