RFR: 8355216: Accelerate P-256 arithmetic on aarch64 [v8]

Wed Feb 25 20:08:08 UTC 2026

On Fri, 20 Feb 2026 17:13:58 GMT, Andrew Haley <aph at openjdk.org> wrote:

>>> That would be really useful! I tinkered with it a bit but would be nice to see what you had in mind
>> 
>> Like this:
>> 
>>   address generate_intpoly_montgomeryMult_P256() {
>> 
>>     __ align(CodeEntryAlignment);
>>     StubId stub_id = StubId::stubgen_intpoly_montgomeryMult_P256_id;
>>     StubCodeMark mark(this, stub_id);
>>     address start = __ pc();
>>     __ enter();
>> 
>>     static const int64_t modulus[] = {
>>       0x000fffffffffffffL, 0x00000fffffffffffL,
>>       0x0000001000000000L, 0x0000ffffffff0000L,
>>       0L
>>     };
>> 
>>     int shift1 = 12; // 64 - bits per limb
>>     int shift2 = 52; // bits per limb
>> 
>>     // Registers that are used throughout entire routine
>>     const Register a = c_rarg0;
>>     const Register b = c_rarg1;
>>     const Register result = c_rarg2;
>> 
>>     RegSet regs = RegSet::range(r0, r28) + rfp + lr - a - b - result;
>>     FloatRegSet floatRegs = FloatRegSet::range(v0, v31)
>>       - FloatRegSet::range(v8, v15)   // Caller saved vectors
>>       - FloatRegSet::range(v16, v31); // Manually-allocated vectors
>> 
>>     auto common_regs = regs.begin();
>>     Register limb_mask = *common_regs++,
>>       c_ptr = *common_regs++,
>>       mod_0 = *common_regs++,
>>       mod_1 = *common_regs++,
>>       mod_3 = *common_regs++,
>>       mod_4 = *common_regs++,
>>       b_0 = *common_regs++,
>>       b_1 = *common_regs++,
>>       b_2 = *common_regs++,
>>       b_3 = *common_regs++,
>>       b_4 = *common_regs++;
>>     regs = common_regs.remaining();
>> 
>>     auto common_vectors = floatRegs.begin();
>>     FloatRegister limb_mask_vec = *common_vectors++,
>>       b_lows = *common_vectors++,
>>       b_highs = *common_vectors++,
>>       a_vals = *common_vectors++;
>> 
>>     // Push callee saved registers on to the stack
>>     RegSet callee_saved = RegSet::range(r19, r28);
>>     __ push(callee_saved, sp);
>> 
>>     // Allocate space on the stack for carry values
>>     __ sub(sp, sp, 48);
>>     __ mov(c_ptr, sp);
>> 
>>     // Calculate limb mask
>>     __ mov(limb_mask, -UCONST64(1) >> (64 - shift2));
>>     __ dup(limb_mask_vec, __ T2D, limb_mask);
>> 
>>     // Load input arrays and modulus
>>     {
>>       auto r = regs.begin();
>>       Register a_ptr = *r++, mod_ptr = *r++;
>>       __ add(a_ptr, a, 24);
>>       __ lea(mod_ptr, ExternalAddress((address)modulus));
>>       __ ldr(b_0, Address(b));
>>       __ ldr(b_1, Address(b, 8));
>>       __ ldr(b_2, Address(b, 16));
>>       __ ldr(b_3, Address(b, 24));
>>       __ ldr(b_4, Address(b, 32));
>>      ...
>
> Note that in a few places I've had to push back dead registers so that they can be reused. This is necessary because the live ranges for some registers partailly overlap.
> 
> It's much better if you don't do that: instead, write a structured assembly-language program in which registers are allocated in scopes as needed, as I've done in the section which begins like this:
> 
> 
>     // Load input arrays and modulus
>     {
>       auto r = regs.begin();
>       Register a_ptr = *r++, mod_ptr = *r++;
> 
> 
> here, the register that contain`a_ptr` and `mod_ptr` are taken from the outer block, and are free for reuse when the inner block exits.
> 
> I hope the advantages of this style are clear: the program is easier to write, to maintain, and much less risky. Also, and most importantly for me, it's much easier to review!

Thanks for taking the time to write all this out! Will do a refactor and integrate these changes shortly

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/27946#discussion_r2855165827