RFR: 8355216: Accelerate P-256 arithmetic on aarch64
Ben Perez
bperez at openjdk.org
Mon Dec 1 21:30:28 UTC 2025
On Fri, 24 Oct 2025 13:36:52 GMT, Andrew Haley <aph at openjdk.org> wrote:
>> An aarch64 implementation of the `MontgomeryIntegerPolynomial256.mult()` method
>
> src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 7144:
>
>> 7142:
>> 7143: address generate_intpoly_montgomeryMult_P256() {
>> 7144:
>
> As a general point, it would help everyone if you provided pseudocode for the whole thing.
Happy to add pseudocode but it will be quite long and almost identical to what is already in `MontgomeryIntegerPolynomial256.mult()` except mine uses a loop instead of unrolling the whole thing
> src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 7160:
>
>> 7158: };
>> 7159:
>> 7160: Register c_ptr = r9;
>
> `rscratch1` and `rscratch2` are used freely by macros, so aliasing them is always rather sketchy. As far as I can tell the arg registers aren't used here, so it makes sense to use `r3`...
Is there a reason hotspot doesn't leave `r9` open for use as a caller saved local variable like in the ARM docs https://developer.arm.com/documentation/102374/0103/Procedure-Call-Standard. Either way will fix.
> src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 7169:
>
>> 7167: Register mul_tmp = r14;
>> 7168: Register n = r15;
>> 7169:
>
> Here, you could do something like
>
>
> RegSet scratch = RegSet::range(r3, r28) - rscratch1 - rscratch2;
>
> {
> auto r_it = scratch.begin();
> Register
> c_ptr = *r_it++,
> a_i = *r_it++,
> c_idx = *r_it++, //c_idx is not used at the same time as a_i
> limb_mask_scalar = *r_it++,
> b_j = *r_it++,
> mod_j = *r_it++,
> mod_ptr = *r_it++,
> mul_tmp = *r_it++,
> n = *r_it++;
> ...
> }
>
>
>
> Note that a RegSet iterator doesn't affect the RegSet it was created from, so once this block has ended you can allocate again from the set of scratch registers.
Is there by any chance documentation for `RegSet` that I can reference while making these changes?
> src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 7229:
>
>> 7227: __ shl(high_01, __ T2D, high_01, shift1);
>> 7228: __ ushr(tmp, __ T2D, low_01, shift2);
>> 7229: __ orr(high_01, __ T2D, high_01, tmp);
>
> Suggestion:
>
> __ orr(high_01, __ T16B, high_01, tmp);
>
> everywhere.
thanks for the suggestion! Does using `T16B` improve performance? Similarly, should this be applied to `EOR` as well?
-------------
PR Review Comment: https://git.openjdk.org/jdk/pull/27946#discussion_r2475133382
PR Review Comment: https://git.openjdk.org/jdk/pull/27946#discussion_r2539340501
PR Review Comment: https://git.openjdk.org/jdk/pull/27946#discussion_r2475267081
PR Review Comment: https://git.openjdk.org/jdk/pull/27946#discussion_r2474686235
More information about the hotspot-dev
mailing list