RFR: 8349721: Add aarch64 intrinsics for ML-KEM [v7]

Tue Apr 15 14:18:54 UTC 2025

On Thu, 10 Apr 2025 13:19:05 GMT, Ferenc Rakoczi <duke at openjdk.org> wrote:

>> By using the aarch64 vector registers the speed of the computation of the ML-KEM algorithms (key generation, encapsulation, decapsulation) can be approximately doubled.
>
> Ferenc Rakoczi has updated the pull request incrementally with two additional commits since the last revision:
> 
>  - Code rearrange, some renaming, fixing comments
>  - Changes suggested by Andrew Dinn.

src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 5665:

> 5663:     vs_ld2_post(vs_back(vs1), __ T8H, nttb);
> 5664:     vs_ld2_post(vs_front(vs4), __ T8H, ntta);
> 5665:     vs_ld2_post(vs_back(vs4), __ T8H, nttb);

Suggestion:

    vs_ld2_post(vs_front(vs1), __ T8H, ntta); // <a0, a1> x 8H
    vs_ld2_post(vs_back(vs1), __ T8H, nttb);  // <b0, b1> x 8H
    vs_ld2_post(vs_front(vs4), __ T8H, ntta); // <a2, a3> x 8H
    vs_ld2_post(vs_back(vs4), __ T8H, nttb);  // <b2, b3> x 8H

src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 5668:

> 5666:     // montmul the first and second pair of values loaded into vs1
> 5667:     // in order and then with one pair reversed storing the  two
> 5668:     // results in vs3

Suggestion:

    // compute 4 montmul cross-products for pairs (a0,a1) and (b0,b1)
    // i.e. montmul the first and second halves of vs1 in order and
    // then with one sequence reversed storing the two results in vs3
    //
    // vs3[0] <- montmul(a0, b0)
    // vs3[1] <- montmul(a1, b1)
    // vs3[2] <- montmul(a0, b1)
    // vs3[3] <- montmul(a1, b0)

src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 5674:

> 5672:     // montmul the first and second pair of values loaded into vs4
> 5673:     // in order and then with one pair reversed storing the two
> 5674:     // results in vs1

Suggestion:

    // compute 4 montmul cross-products for pairs (a2,a3) and (b2,b3)
    // i.e. montmul the first and second halves of vs4 in order and
    // then with one sequence reversed storing the two results in vs1
    //
    // vs1[0] <- montmul(a2, b2)
    // vs1[1] <- montmul(a3, b3)
    // vs1[2] <- montmul(a2, b3)
    // vs1[3] <- montmul(a3, b2)

src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 5680:

> 5678:     // for each pair of results pick the second value in the first
> 5679:     // pair to create a sequence that we montmul by the zetas
> 5680:     // i.e. we want sequence <vs3[1], vs1[1]>

Suggestion:

    // montmul result 2 of each cross-product i.e. (a1*b1, a3*b3) by a zeta.
    // We can schedule two montmuls at a time if we use a suitable vector
    // sequence <vs3[1], vs1[1]>.

src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 5683:

> 5681:     int delta = vs1[1]->encoding() - vs3[1]->encoding();
> 5682:     VSeq<2> vs5(vs3[1], delta);
> 5683:     kyber_montmul16(vs5, vz, vs5, vs_front(vs2), vq);

Suggestion:

    // vs3[1] <- montmul(montmul(a1, b1), z0)
    // vs1[1] <- montmul(montmul(a3, b3), z1)
    kyber_montmul16(vs5, vz, vs5, vs_front(vs2), vq);

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23663#discussion_r2044679089
PR Review Comment: https://git.openjdk.org/jdk/pull/23663#discussion_r2044682671
PR Review Comment: https://git.openjdk.org/jdk/pull/23663#discussion_r2044684696
PR Review Comment: https://git.openjdk.org/jdk/pull/23663#discussion_r2044689607
PR Review Comment: https://git.openjdk.org/jdk/pull/23663#discussion_r2044691632