RFR: 8341527: AVX-512 intrinsic for SHA3 [v4]

Tue Oct 15 23:04:09 UTC 2024

On Tue, 15 Oct 2024 15:51:59 GMT, Ferenc Rakoczi <duke at openjdk.org> wrote:

>> Confirmed performance on my dev machine. Looks good!
>> 
>> Instruction selection: no complaints. `vperm*` instructions tend to be slower on AVX2, but work great here. Clean, compact and easy-to-read implementation
>> 
>> I don't know enough about SHA3 to do a line-by-line asm review, but that leads me to 'experimentally confirm correctness': testing.
>> 
>> I am wondering how you verified your code. I did spot the existing SHA3 KAT tests from the NIST PDF. The problem with those is that unless you run tests with `-Xcomp -XX:-TieredCompilation`, the test will finish before the code is even compiled. I've done that before, running test twice with either options; its 'better then nothing' (unless I am not seeing some more tests?). I much prefer some sort of fuzzing; one great thing about working on JCE intrinsics is having a ready-made 'reference implementation' to verify things against.
>> 
>> Except I am not sure how one would implement fuzzing for SHA3, perhaps you have some thoughts. It seems impossible to have both intrinsic and java/interpreter running concurrently. For Poly1305IntrinsicFuzzTest, I used the fact that single-block digest is not intrinsified. For MontgomeryPolynomialFuzzTest, I used the fact that we have a residue-domain implementation to compare against.
>> 
>> For SHA3, all roads lead to the intrinsic (which is a good thing.. except for testing). No DirectByteBuffer, nor single-block bypass.. The only potential thought is the fact that single-block intrinsic appears unreachable. Looking at `DigestBase.implCompressMultiBlock`, it will always call the multi-block intrinsic (unless I am missing some fancy predicate-generation by the JIT).
>> 
>> If `DigestBase.implCompressMultiBlock` were 'fixed' to require at least 2 full blocks, before calling the multiblock intrinsic, then one could implement fuzzing by alternatively disabling one of the non-/multi-block intrinsics.
>
>> Confirmed performance on my dev machine. Looks good!
>> 
> 
> Thanks for looking at it!
> 
>> Instruction selection: no complaints. `vperm*` instructions tend to be slower on AVX2, but work great here. Clean, compact and easy-to-read implementation
>> 
>> I don't know enough about SHA3 to do a line-by-line asm review, but that leads me to 'experimentally confirm correctness': testing.
>> 
>> I am wondering how you verified your code. I did spot the existing SHA3 KAT tests from the NIST PDF. The problem with those is that unless you run tests with `-Xcomp -XX:-TieredCompilation`, the test will finish before the code is even compiled. I've done that before, running test twice with either options; its 'better then nothing' (unless I am not seeing some more tests?). I much prefer some sort of fuzzing; one great thing about working on JCE intrinsics is having a ready-made 'reference implementation' to verify things against.
> 
>  I was developing this as part of the ML-KEM and ML-DSA implementations, and there SHA3 is called quite frequently, so the test for those will test the SHA3 intrinsics, too.
> 
>  The algorithms for the hash (digest) functions are designed so that any programming error would lead to erroneous output on any input, so if your implementation produces the correct result on a few randomly chosen inputs of sizes varying from 0 bytes to several blocks then you can claim with high confidence that it is correct. 
>> 
>> Except I am not sure how one would implement fuzzing for SHA3, perhaps you have some thoughts. It seems impossible to have both intrinsic and java/interpreter running concurrently. For Poly1305IntrinsicFuzzTest, I used the fact that single-block digest is not intrinsified. For MontgomeryPolynomialFuzzTest, I used the fact that we have a residue-domain implementation to compare against.
>> 
>> For SHA3, all roads lead to the intrinsic (which is a good thing.. except for testing). No DirectByteBuffer, nor single-block bypass.. The only potential thought is the fact that single-block intrinsic appears unreachable. Looking at `DigestBase.implCompressMultiBlock`, it will always call the multi-block intrinsic (unless I am missing some fancy predicate-generation by the JIT).
>  
>  In a test, you can always just copy the pure Java implementation into the test and compare the results. During development of the intrinsics I like to use methods that return 0 from the intrinsic and 1 from the pure Java implementation and at the call sites, if the method returns 0  I...

@ferakocz I thank you need to enble SHA3 testing in jtreg tests we have by modifying:
https://github.com/openjdk/jdk/blob/master/test/hotspot/jtreg/compiler/testlibrary/sha/predicate/IntrinsicPredicates.java#L106

[JDK-8252204](https://bugs.openjdk.org/browse/JDK-8252204) added several C2 tests for SHA3 intrinsics in `test/hotspot/jtreg/compiler/intrinsics/sha`. Please make sure your changes passed those tests.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/21352#issuecomment-2415313075