RFR: 8341527: AVX-512 intrinsic for SHA3 [v4]

Thu Oct 10 17:00:24 UTC 2024

On Sat, 5 Oct 2024 16:47:12 GMT, Ferenc Rakoczi <duke at openjdk.org> wrote:

>> There is already an intrinsic for SHA-3 for aarch64, which gives significant speed improvement on that architecture, so this pull request is bringing similar improvement for tha x64 family of systems that have the AVX-512 extension. Rudimentary measurements show that 30-40% speed improvement can be achieved.
>
> Ferenc Rakoczi has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains four additional commits since the last revision:
> 
>  - Merge branch 'master' into sha3-avx512-intrinsic
>  - fix windows build
>  - fix debug build
>  - 8341527: AVX-512 intrinsic for SHA3

Confirmed performance on my dev machine. Looks good!

Instruction selection: no complaints. `vperm*` instructions tend to be slower on AVX2, but work great here. Clean, compact and easy-to-read implementation

I don't know enough about SHA3 to do a line-by-line asm review, but that leads me to 'experimentally confirm correctness': testing.

I am wondering how you verified your code. I did spot the existing SHA3 KAT tests from the NIST PDF. The problem with those is that unless you run tests with `-Xcomp -XX:-TieredCompilation`, the test will finish before the code is even compiled. I've done that before, running test twice with either options; its 'better then nothing' (unless I am not seeing some more tests?). I much prefer some sort of fuzzing; one great thing about working on JCE intrinsics is having a ready-made 'reference implementation' to verify things against.

Except I am not sure how one would implement fuzzing for SHA3, perhaps you have some thoughts. It seems impossible to have both intrinsic and java/interpreter running concurrently. For Poly1305IntrinsicFuzzTest, I used the fact that single-block digest is not intrinsified. For MontgomeryPolynomialFuzzTest, I used the fact that we have a residue-domain implementation to compare against.

For SHA3, all roads lead to the intrinsic (which is a good thing.. except for testing). No DirectByteBuffer, nor single-block bypass.. The only potential thought is the fact that single-block intrinsic appears unreachable. Looking at `DigestBase.implCompressMultiBlock`, it will always call the multi-block intrinsic (unless I am missing some fancy predicate-generation by the JIT).

If `DigestBase.implCompressMultiBlock` were 'fixed' to require at least 2 full blocks, before calling the multiblock intrinsic, then one could implement fuzzing by alternatively disabling one of the non-/multi-block intrinsics.

src/hotspot/cpu/x86/macroAssembler_x86.hpp line 1579:

> 1577:     }
> 1578:   }
> 1579:   void evpsrad(XMMRegister dst, KRegister mask, XMMRegister src, int shift, bool merge, int vector_len) {

more compact way to 'unhide' function from Assembler.hpp is the `using` C++ feature : `using Assembler::evpsrad;`. (You can see it being used bit further below, line 1589)

Comment repeats in (several) changes in this file.

src/hotspot/cpu/x86/stubGenerator_x86_64_sha3.cpp line 85:

> 83: void StubGenerator::generate_sha3_stubs() {
> 84:   if (UseSHA3Intrinsics) {
> 85:     if (VM_Version::supports_evex()) {

This really should be an assert. i.e. All cpu-flag checks should be done in vm_version_x86.cpp and by this point if `UseSHA3Intrinsics` is on, "we are good to go"

src/hotspot/cpu/x86/stubGenerator_x86_64_sha3.cpp line 148:

> 146:   __ addl(rax, 8);
> 147:   __ kmovbl(k4, rax);
> 148:   __ addl(rax, 16);

Since you need k5 soonest, you could save a few cycles by removing the propagation dependency on rax and loading the immediate directly..

(If you really want to get clever, 

  KRegister masks[] = {k1,k2,k3,k4,k5};
  for (long i=2; i<=32; i*=2) {
    __ mov64(rax, i-1);
    __ kmovbl(masks[i], rax);
  }
  ```
  Highly debatable if its actually any more readable.. so up to you)

src/hotspot/cpu/x86/stubGenerator_x86_64_sha3.cpp line 152:

> 150: 
> 151:   // load the state
> 152:   __ evmovdquq(xmm0, k5, Address(state, 0), false, Assembler::AVX_512bit);

There is a potential issue with masked loads, but I confirmed (I believe) that it is not an issue here.

Background and what I tried. See [Optimization Reference Manual Vol. Chap 17.4](https://www.intel.com/content/www/us/en/developer/articles/technical/intel-sdm.html)

When using masked store and load, consider the following:
- When the mask is not all-ones or all-zeroes, the load operation, following the masked store operation from the 
same address is blocked, until the data is written to the cache.

(I was hit by this in the Montgomery ECC intrinsic. The solution there was to break masked load into full loads/stores. That is a 5-element 512 load/store becomes a 64bit mov and 256-bit mov)

I ran JMH with this PR enabled and  collected perfasm. Didn't see any ticks on the loads(entry)/stores(exit). I believe this is because  of the `implCompressMultiBlock`. The only reason for the processor to use store-forwarding is if a store is followed immediately by a load. i.e. if `implCompress` was being called in a loop (it is), the first intrinsic would be fine, but all the subsequent ones would get stalled on entry, loading what the previous iteration only just saved.

With `implCompressMultiBlock`, the next iteration comes 'much later', and doesnt need store-forwarding.

src/hotspot/cpu/x86/stubGenerator_x86_64_sha3.cpp line 178:

> 176: 
> 177:   // there will be 24 keccak rounds
> 178:   __ movl(roundsLeft, 24);

Since `roundsLeft` is a constant, you can use a C++ loop here instead. I believe the only loop variable is `constant2use`, used for memory address access.. slightly more constant access?

Not sure if this will give you any performance benefit, but its a 'relatively easy' experiment. (ie. 23 extra copies of the loop, worth the extra cache pressure?)

src/hotspot/cpu/x86/vm_version_x86.cpp line 1316:

> 1314:   }
> 1315: 
> 1316:   if (UseAVX > 2) {

Should be `#ifdef _LP64`. (Similar format from above). Need to look up the cpu features required for the instructions in the intrinsic..

-------------

PR Review: https://git.openjdk.org/jdk/pull/21352#pullrequestreview-2352787214
PR Review Comment: https://git.openjdk.org/jdk/pull/21352#discussion_r1790783146
PR Review Comment: https://git.openjdk.org/jdk/pull/21352#discussion_r1790777522
PR Review Comment: https://git.openjdk.org/jdk/pull/21352#discussion_r1792630210
PR Review Comment: https://git.openjdk.org/jdk/pull/21352#discussion_r1792517204
PR Review Comment: https://git.openjdk.org/jdk/pull/21352#discussion_r1792534573
PR Review Comment: https://git.openjdk.org/jdk/pull/21352#discussion_r1790776557