RFR: 8351034: Add AVX-512 intrinsics for ML-DSA [v11]

Wed Apr 2 08:22:22 UTC 2025

On Thu, 27 Mar 2025 21:42:08 GMT, Volodymyr Paprotski <vpaprotski at openjdk.org> wrote:

>> Ferenc Rakoczi has updated the pull request incrementally with two additional commits since the last revision:
>> 
>>  - Further readability improvements.
>>  - Added asserts for array sizes
>
> src/hotspot/cpu/x86/stubGenerator_x86_64_sha3.cpp line 342:
> 
>> 340: // Performs two keccak() computations in parallel. The steps of the
>> 341: // two computations are executed interleaved.
>> 342: static address generate_double_keccak(StubGenerator *stubgen, MacroAssembler *_masm) {
> 
> This function seems ok. I didnt do as line-by-line 'exact' review as for the NTT intrinsics, but just put the new version into a diff next to the original function. Seems like a reasonable clean 'refactor' (hardcode the blocksize, add new input registers 10-14. Makes it really easy to spot vs 0-4 original registers..)
> 
> I didnt realize before that the 'top 3 limbs' are wasted. I guess it doesnt matter, there are registers to spare aplenty and it makes the entire algorithm cleaner and easier to follow.
> 
> I did also stare at the algorithm with the 'What about AVX2' question.. This function would pretty much need to be rewritten it looks like :/
> 
> Last two questions.. 
> - how much performance is gained from doubling this function up?
> - If thats worth it.. what if instead it was quadrupled the input? (I scanned the java code, it looked like NR was parametrized already to 2..). It looks like there are almost enough registers here to go to 4 (I think 3 would need to be freed up somehow.. alternatively, the upper 3 limbs are empty in all operations, perhaps it could be used instead.. at the expense of readability)

Well, the algorithm (keccak()) is doing the same things on 5 array elements (It works on essentially a 5x5 matrix doing row and column operations, so putting 5 array entries in a vector register was the "natural" thing to do).

This function can only be used under very special circumstances, which occur during the generation of tha "A matrix" in ML-KEM and ML-DSA, the speed of that matrix generation has almost doubled (I don't have exact numbers).

We are using 7 registers per state and 15 for the constants, so we have only 3 to spare. We could perhaps juggle with the constants keeping just the ones that will be needed next in registers and reloading them "just in time", but that might slow things down a bit - more load instructions executed + maybe some load delay. On the other hand, more parallelism. I might try it out.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r2024317665