RFR: 8351034: Add AVX-512 intrinsics for ML-DSA [v11]

Tue Apr 1 18:47:39 UTC 2025

On Sat, 22 Mar 2025 20:02:31 GMT, Ferenc Rakoczi <duke at openjdk.org> wrote:

>> By using the AVX-512 vector registers the speed of the computation of the ML-DSA algorithms (key generation, document signing, signature verification) can be approximately doubled.
>
> Ferenc Rakoczi has updated the pull request incrementally with two additional commits since the last revision:
> 
>  - Further readability improvements.
>  - Added asserts for array sizes

src/hotspot/cpu/x86/stubGenerator_x86_64_sha3.cpp line 342:

> 340: // Performs two keccak() computations in parallel. The steps of the
> 341: // two computations are executed interleaved.
> 342: static address generate_double_keccak(StubGenerator *stubgen, MacroAssembler *_masm) {

This function seems ok. I didnt do as line-by-line 'exact' review as for the NTT intrinsics, but just put the new version into a diff next to the original function. Seems like a reasonable clean 'refactor' (hardcode the blocksize, add new input registers 10-14. Makes it really easy to spot vs 0-4 original registers..)

I didnt realize before that the 'top 3 limbs' are wasted. I guess it doesnt matter, there are registers to spare aplenty and it makes the entire algorithm cleaner and easier to follow.

I did also stare at the algorithm with the 'What about AVX2' question.. This function would pretty much need to be rewritten it looks like :/

Last two questions.. 
- how much performance is gained from doubling this function up?
- If thats worth it.. what if instead it was quadrupled the input? (I scanned the java code, it looked like NR was parametrized already to 2..). It looks like there are almost enough registers here to go to 4 (I think 3 would need to be freed up somehow.. alternatively, the upper 3 limbs are empty in all operations, perhaps it could be used instead.. at the expense of readability)

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r2017636762