RFR: 8349721: Add aarch64 intrinsics for ML-KEM [v7]
Andrew Dinn
adinn at openjdk.org
Mon Apr 14 13:00:45 UTC 2025
On Thu, 10 Apr 2025 13:19:05 GMT, Ferenc Rakoczi <duke at openjdk.org> wrote:
>> By using the aarch64 vector registers the speed of the computation of the ML-KEM algorithms (key generation, encapsulation, decapsulation) can be approximately doubled.
>
> Ferenc Rakoczi has updated the pull request incrementally with two additional commits since the last revision:
>
> - Code rearrange, some renaming, fixing comments
> - Changes suggested by Andrew Dinn.
src/hotspot/cpu/aarch64/register_aarch64.hpp line 509:
> 507: }
> 508:
> 509: // convenience methods for splitting 8-way of 4-way vector register
Suggestion:
// convenience methods for splitting 8-way or 4-way vector register
src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 5012:
> 5010: assert(!va.is_constant(), "output vector must identify 2 different registers");
> 5011:
> 5012: // schedule 2 streams of i<nstructions across the vector sequences
Suggestion:
// schedule 2 streams of instructions across the vector sequences
src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 5166:
> 5164: //
> 5165: // On each level, we fill up the vector registers in such a way that the
> 5166: // array elements that need to be multiplied by the zetas be in one
Suggestion:
// array elements that need to be multiplied by the zetas are in one
src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 5168:
> 5166: // array elements that need to be multiplied by the zetas be in one
> 5167: // set of vector registers while the corresponding ones that don't need to
> 5168: // be multiplied, in another set. We can do 32 Montgomery multiplications
Suggestion:
// be multiplied are in another set. We can do 32 Montgomery multiplications
src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 5278:
> 5276: // level 4
> 5277: vs_ldpq(vq, kyberConsts);
> 5278: int offsets3[8] = { 0, 32, 64, 96, 128, 160, 192, 224 };
I'd like to add comment here to explain the coefficient grouping
and likewise at level 5 and 6. So here we have:
Suggestion:
// Up to level 3 the coefficients multiplied by or added/subtracted
// to the zetas occur in discrete blocks whose size is some multiple
// of 32. At level 4 coefficients occur in 8 discrete blocks of size 16
// so they are loaded using an ldr at 8 distinct offsets.
int offsets3[8] = { 0, 32, 64, 96, 128, 160, 192, 224 };
src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 5299:
> 5297:
> 5298: // level 5
> 5299: vs_ldpq(vq, kyberConsts);
Suggestion:
vs_ldpq(vq, kyberConsts);
// At level 5 related coefficients occur in discrete blocks of size 8 so
// need to be loaded interleaved using an ld2 operation with arrangement 2D
src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 5319:
> 5317: vs_st2_indexed(vs1, __ T2D, coeffs, tmpAddr, 384, offsets4);
> 5318:
> 5319: // level 6
Suggestion:
// level 6
// At level 6 related coefficients occur in discrete blocks of size 4 so
// need to be loaded interleaved using an ld2 operation with arrangement 4S
src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 5377:
> 5375: // level 0
> 5376: vs_ldpq(vq, kyberConsts);
> 5377: int offsets4[4] = { 0, 32, 64, 96 };
Again a comment
Suggestion:
// At level 6 related coefficients occur in discrete blocks of size 4 so
// need to be loaded interleaved using an ld2 operation with arrangement 4S
int offsets4[4] = { 0, 32, 64, 96 };
src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 5399:
> 5397: vs_st2_indexed(vs1, __ T4S, coeffs, tmpAddr, 384, offsets4);
> 5398:
> 5399: // level 1
Again a comment
Suggestion:
// level 1
// At level 1 related coefficients occur in discrete blocks of size 8 so
// need to be loaded interleaved using an ld2 operation with arrangement 2D
src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 5422:
> 5420: vs_st2_indexed(vs1, __ T2D, coeffs, tmpAddr, 384, offsets4);
> 5421:
> 5422: // level 2
Again
Suggestion:
// level 2
// At level 2 coefficients occur in 8 discrete blocks of size 16
// so they are loaded using employing an ldr at 8 distinct offsets.
src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 5464:
> 5462: vs_str_indexed(vs1, __ Q, coeffs, 256, offsets3);
> 5463:
> 5464: // level 3
Suggestion:
// level 3
// From level 3 upwards coefficients occur in discrete blocks whose size is
// some multiple of 32 so can be loaded using ldpq and suitable indexes.
src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 5591:
> 5589: store64shorts(vs2, tmpAddr);
> 5590:
> 5591: load64shorts(vs1, tmpAddr);
I'd like to make explicit the fact that we have avoided doing an add here (and in the next two cases) by adding a commented out generation step i.e. at this line insert
Suggestion:
// __ add(tmpAddr, coeffs, 128); // unneeded as implied by preceding store
load64shorts(vs1, tmpAddr);
src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 5596:
> 5594: store64shorts(vs2, tmpAddr);
> 5595:
> 5596: load64shorts(vs1, tmpAddr);
Likewise insert:
Suggestion:
// __ add(tmpAddr, coeffs, 256); // unneeded as implied by preceding store
load64shorts(vs1, tmpAddr);
src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 5601:
> 5599: store64shorts(vs2, tmpAddr);
> 5600:
> 5601: load64shorts(vs1, tmpAddr);
Likewise insert:
Suggestion:
// __ add(tmpAddr, coeffs, 384); // unneeded as implied by preceding store
load64shorts(vs1, tmpAddr);
src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 5640:
> 5638: VSeq<4> vs1(0), vs2(4); // 4 sets of 8x8H inputs/outputs/tmps
> 5639: VSeq<4> vs3(16), vs4(20);
> 5640: VSeq<2> vq(30); // pair of constants for montmul
Suggestion:
VSeq<2> vq(30); // pair of constants for montmul: qinv, q
src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 5642:
> 5640: VSeq<2> vq(30); // pair of constants for montmul
> 5641: VSeq<2> vz(28); // pair of zetas
> 5642: VSeq<4> vc(27, 0); // constant sequence for montmul
Suggestion:
VSeq<4> vc(27, 0); // constant for montmul: montRSquareModQ
src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 6094:
> 6092: VSeq<8> vc2_1(31, 0);
> 6093: VSeq<2> vc2_2(31, 0);
> 6094: FloatRegister vc2_3 = v31;
Suggestion:
// we also need a pair of corresponding constant sequences
VSeq<8> vc1_1(30, 0); // kyber_q
VSeq<2> vc1_2(30, 0);
FloatRegister vc1_3 = v30;
VSeq<8> vc2_1(31, 0); // kyberBarrettMultiplier
VSeq<2> vc2_2(31, 0);
FloatRegister vc2_3 = v31;
src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 6102:
> 6100: // load q and the multiplier for the Barrett reduction
> 6101: __ add(kyberConsts, kyberConsts, 16);
> 6102: __ ldpq(vc1_3, vc2_3, kyberConsts);
Suggestion:
__ ldpq(vc1_3, vc2_3, kyberConsts); // kyber_q, kyberBarrettMultiplier
src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 6111:
> 6109: __ ldr(vs1_3, __ Q, __ post(coeffs, 16));
> 6110: }
> 6111: vs_sqdmulh(vs2_1, __ T8H, vs1_1, vc2_1);
Suggestion:
vs_sqdmulh(vs2_1, __ T8H, vs1_1, vc2_1); // vs2 <- (2 * vs1 * kyberBarrettMultiplier) >> 16
src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 6116:
> 6114: __ sqdmulh(vs2_3, __ T8H, vs1_3, vc2_3);
> 6115: }
> 6116: vs_sshr(vs2_1, __ T8H, vs2_1, 11);
Suggestion:
vs_sshr(vs2_1, __ T8H, vs2_1, 11); // vs2 <- (vs1 * kyberBarrettMultiplier) >> 26
src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 6121:
> 6119: __ sshr(vs2_3, __ T8H, vs2_3, 11);
> 6120: }
> 6121: vs_mlsv(vs1_1, __ T8H, vs2_1, vc1_1);
Suggestion:
vs_mlsv(vs1_1, __ T8H, vs2_1, vc1_1); // vs1 <- vs1 - vs2 * kyber_q
-------------
PR Review Comment: https://git.openjdk.org/jdk/pull/23663#discussion_r2041886663
PR Review Comment: https://git.openjdk.org/jdk/pull/23663#discussion_r2041888092
PR Review Comment: https://git.openjdk.org/jdk/pull/23663#discussion_r2041889242
PR Review Comment: https://git.openjdk.org/jdk/pull/23663#discussion_r2041889966
PR Review Comment: https://git.openjdk.org/jdk/pull/23663#discussion_r2041892243
PR Review Comment: https://git.openjdk.org/jdk/pull/23663#discussion_r2041892994
PR Review Comment: https://git.openjdk.org/jdk/pull/23663#discussion_r2041893924
PR Review Comment: https://git.openjdk.org/jdk/pull/23663#discussion_r2041895469
PR Review Comment: https://git.openjdk.org/jdk/pull/23663#discussion_r2041896626
PR Review Comment: https://git.openjdk.org/jdk/pull/23663#discussion_r2041898215
PR Review Comment: https://git.openjdk.org/jdk/pull/23663#discussion_r2041899623
PR Review Comment: https://git.openjdk.org/jdk/pull/23663#discussion_r2041902206
PR Review Comment: https://git.openjdk.org/jdk/pull/23663#discussion_r2041902833
PR Review Comment: https://git.openjdk.org/jdk/pull/23663#discussion_r2041906126
PR Review Comment: https://git.openjdk.org/jdk/pull/23663#discussion_r2041909711
PR Review Comment: https://git.openjdk.org/jdk/pull/23663#discussion_r2041929011
PR Review Comment: https://git.openjdk.org/jdk/pull/23663#discussion_r2041867539
PR Review Comment: https://git.openjdk.org/jdk/pull/23663#discussion_r2041868670
PR Review Comment: https://git.openjdk.org/jdk/pull/23663#discussion_r2041881429
PR Review Comment: https://git.openjdk.org/jdk/pull/23663#discussion_r2041883012
PR Review Comment: https://git.openjdk.org/jdk/pull/23663#discussion_r2041883686
More information about the hotspot-dev
mailing list