RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions [v16]
Volodymyr Paprotski
duke at openjdk.org
Tue Nov 15 17:44:12 UTC 2022
On Tue, 15 Nov 2022 00:06:40 GMT, Vladimir Ivanov <vlivanov at openjdk.org> wrote:
>> Volodymyr Paprotski has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 23 commits:
>>
>> - Merge remote-tracking branch 'origin/master' into avx512-poly
>> - Vladimir's review
>> - live review with Sandhya
>> - jcheck
>> - Sandhya's review
>> - fix windows and 32b linux builds
>> - add getLimbs to interface and reviews
>> - fix 32-bit build
>> - make UsePolyIntrinsics option diagnostic
>> - Merge remote-tracking branch 'origin/master' into avx512-poly
>> - ... and 13 more: https://git.openjdk.org/jdk/compare/e269dc03...a26ac7db
>
> src/hotspot/cpu/x86/stubGenerator_x86_64_poly.cpp line 384:
>
>> 382: void StubGenerator::poly1305_limbs(const Register limbs, const Register a0, const Register a1, const Register a2, bool only128)
>> 383: {
>> 384: const Register t1 = r13;
>
> Please, make the temps explicit and lift them into arguments. Otherwise, it's hard to see what registers are clobbered when helper methods are called.
Thanks for pointing this out.. I spent quite a bit of time and went back and forth on 'register allocation'... it does make sense to pass all the temps needed, when the number of temps is small. This is the case for the three `*_limbs_*` functions. Maybe I should indeed do that...
On other hand, there are functions like `poly1305_multiply8_avx512` and `poly1305_process_blocks_avx512` that use a _lot_ of temp registers. I think it makes sense to keep those as 'function-header declarations'.
Then there are functions like `poly1305_multiply_scalar` that could go either way, has some temps and 'implicitly clobbered' registers, but probably should stay 'as is'..
I ended up being 'pedantic' and making _all_ temps into 'header variables'. I also tried to comment, but those probably mean more to me then anyone else in hindsight?
// Register Map:
// GPRs:
// input = rdi
// length = rbx
// accumulator = rcx
// R = r8
// a0 = rsi
// a1 = r9
// a2 = r10
// r0 = r11
// r1 = r12
// c1 = r8;
// t1 = r13
// t2 = r14
// t3 = r15
// t0 = r14
// rscratch = r13
// stack(rsp, rbp)
// imul(rax, rdx)
// ZMMs:
// T: xmm0-6
// C: xmm7-9
// A: xmm13-18
// B: xmm19-24
// R: xmm25-29
...
// Register Map:
// reserved: rsp, rbp, rcx
// PARAMs: rdi, rbx, rsi, r8-r12
// poly1305_multiply_scalar clobbers: r13-r15, rax, rdx
const Register t0 = r14;
const Register t1 = r13;
const Register rscratch = r13;
// poly1305_limbs_avx512 clobbers: xmm0, xmm1
// poly1305_multiply8_avx512 clobbers: xmm0-xmm6
const XMMRegister T0 = xmm2;
...
I think I am ok changing the `*limbs*` functions (even started, before I remembered my train of thought from months back..) but let me know if you agree with the rest of the reasoning?
-------------
PR: https://git.openjdk.org/jdk/pull/10582
More information about the hotspot-compiler-dev
mailing list