RFR: 8341527: AVX-512 intrinsic for SHA3 [v4]

Tue Oct 15 15:54:14 UTC 2024

On Thu, 10 Oct 2024 16:57:36 GMT, Volodymyr Paprotski <duke at openjdk.org> wrote:

> Confirmed performance on my dev machine. Looks good!
> 

Thanks for looking at it!

> Instruction selection: no complaints. `vperm*` instructions tend to be slower on AVX2, but work great here. Clean, compact and easy-to-read implementation
> 
> I don't know enough about SHA3 to do a line-by-line asm review, but that leads me to 'experimentally confirm correctness': testing.
> 
> I am wondering how you verified your code. I did spot the existing SHA3 KAT tests from the NIST PDF. The problem with those is that unless you run tests with `-Xcomp -XX:-TieredCompilation`, the test will finish before the code is even compiled. I've done that before, running test twice with either options; its 'better then nothing' (unless I am not seeing some more tests?). I much prefer some sort of fuzzing; one great thing about working on JCE intrinsics is having a ready-made 'reference implementation' to verify things against.

 I was developing this as part of the ML-KEM and ML-DSA implementations, and there SHA3 is called quite frequently, so the test for those will test the SHA3 intrinsics, too.

 The algorithms for the hash (digest) functions are designed so that any programming error would lead to erroneous output on any input, so if your implementation produces the correct result on a few randomly chosen inputs of sizes varying from 0 bytes to several blocks then you can claim with high confidence that it is correct. 
> 
> Except I am not sure how one would implement fuzzing for SHA3, perhaps you have some thoughts. It seems impossible to have both intrinsic and java/interpreter running concurrently. For Poly1305IntrinsicFuzzTest, I used the fact that single-block digest is not intrinsified. For MontgomeryPolynomialFuzzTest, I used the fact that we have a residue-domain implementation to compare against.
> 
> For SHA3, all roads lead to the intrinsic (which is a good thing.. except for testing). No DirectByteBuffer, nor single-block bypass.. The only potential thought is the fact that single-block intrinsic appears unreachable. Looking at `DigestBase.implCompressMultiBlock`, it will always call the multi-block intrinsic (unless I am missing some fancy predicate-generation by the JIT).

 In a test, you can always just copy the pure Java implementation into the test and compare the results. During development of the intrinsics I like to use methods that return 0 from the intrinsic and 1 from the pure Java implementation and at the call sites, if the method returns 0  I also call the pure Java version (with a clone of the original inputs) and compare the results. 

> 
> If `DigestBase.implCompressMultiBlock` were 'fixed' to require at least 2 full blocks, before calling the multiblock intrinsic, then one could implement fuzzing by alternatively disabling one of the non-/multi-block intrinsics.

> src/hotspot/cpu/x86/macroAssembler_x86.hpp line 1579:
> 
>> 1577:     }
>> 1578:   }
>> 1579:   void evpsrad(XMMRegister dst, KRegister mask, XMMRegister src, int shift, bool merge, int vector_len) {
> 
> more compact way to 'unhide' function from Assembler.hpp is the `using` C++ feature : `using Assembler::evpsrad;`. (You can see it being used bit further below, line 1589)
> 
> Comment repeats in (several) changes in this file.

Changed as suggested.

> src/hotspot/cpu/x86/stubGenerator_x86_64_sha3.cpp line 85:
> 
>> 83: void StubGenerator::generate_sha3_stubs() {
>> 84:   if (UseSHA3Intrinsics) {
>> 85:     if (VM_Version::supports_evex()) {
> 
> This really should be an assert. i.e. All cpu-flag checks should be done in vm_version_x86.cpp and by this point if `UseSHA3Intrinsics` is on, "we are good to go"

Changed as suggested.

> src/hotspot/cpu/x86/stubGenerator_x86_64_sha3.cpp line 152:
> 
>> 150: 
>> 151:   // load the state
>> 152:   __ evmovdquq(xmm0, k5, Address(state, 0), false, Assembler::AVX_512bit);
> 
> There is a potential issue with masked loads, but I confirmed (I believe) that it is not an issue here.
> 
> Background and what I tried. See [Optimization Reference Manual Vol. Chap 17.4](https://www.intel.com/content/www/us/en/developer/articles/technical/intel-sdm.html)
> 
> When using masked store and load, consider the following:
> - When the mask is not all-ones or all-zeroes, the load operation, following the masked store operation from the 
> same address is blocked, until the data is written to the cache.
> 
> (I was hit by this in the Montgomery ECC intrinsic. The solution there was to break masked load into full loads/stores. That is a 5-element 512 load/store becomes a 64bit mov and 256-bit mov)
> 
> I ran JMH with this PR enabled and  collected perfasm. Didn't see any ticks on the loads(entry)/stores(exit). I believe this is because  of the `implCompressMultiBlock`. The only reason for the processor to use store-forwarding is if a store is followed immediately by a load. i.e. if `implCompress` was being called in a loop (it is), the first intrinsic would be fine, but all the subsequent ones would get stalled on entry, loading what the previous iteration only just saved.
> 
> With `implCompressMultiBlock`, the next iteration comes 'much later', and doesnt need store-forwarding.

Based on your comment I kept it as is.

> src/hotspot/cpu/x86/stubGenerator_x86_64_sha3.cpp line 178:
> 
>> 176: 
>> 177:   // there will be 24 keccak rounds
>> 178:   __ movl(roundsLeft, 24);
> 
> Since `roundsLeft` is a constant, you can use a C++ loop here instead. I believe the only loop variable is `constant2use`, used for memory address access.. slightly more constant access?
> 
> Not sure if this will give you any performance benefit, but its a 'relatively easy' experiment. (ie. 23 extra copies of the loop, worth the extra cache pressure?)

I rather keep the code compact.

> src/hotspot/cpu/x86/vm_version_x86.cpp line 1316:
> 
>> 1314:   }
>> 1315: 
>> 1316:   if (UseAVX > 2) {
> 
> Should be `#ifdef _LP64`. (Similar format from above). Need to look up the cpu features required for the instructions in the intrinsic..

Added the #ifdef.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/21352#issuecomment-2414406442
PR Review Comment: https://git.openjdk.org/jdk/pull/21352#discussion_r1801483332
PR Review Comment: https://git.openjdk.org/jdk/pull/21352#discussion_r1801483262
PR Review Comment: https://git.openjdk.org/jdk/pull/21352#discussion_r1801483419
PR Review Comment: https://git.openjdk.org/jdk/pull/21352#discussion_r1801483537
PR Review Comment: https://git.openjdk.org/jdk/pull/21352#discussion_r1801483204