RFR: 8337666: AArch64: SHA3 GPR intrinsic

Thu Apr 24 08:22:48 UTC 2025

On Wed, 26 Mar 2025 15:55:59 GMT, Dmitry Chuyko <dchuyko at openjdk.org> wrote:

> This is an implementation of SHA3 intrinsics for AArch64 that operates GPRs. It follows the Java implementation algorithm but eagerly uses available registers. For example, FP+R18 are used when it's allowed. On simpler cores like RPi3 or Surface Pro it is 23-53% faster than C2 compiled version; on Graviton 3 it is 8-14% faster than C2 compiled version (which is faster than the current intrinsic); on Apple Silicon it is faster than C2 compiled version but slower than the ARMv8.2-SHA intrinsic. Improvements on a particular CPU depend on the input length. For instance, for Graviton 2:
> 
> 
> Benchmark (ops/ms)	(digesterName)	(length)	G2
> MessageDigests.digest	SHA3-256	64	28.28%
> MessageDigests.digest	SHA3-256	16384	53.58%
> MessageDigests.digest	SHA3-512	64	27.97%
> MessageDigests.digest	SHA3-512	16384	43.90%
> MessageDigests.getAndDigest	SHA3-256	64	26.18%
> MessageDigests.getAndDigest	SHA3-256	16384	52.82%
> MessageDigests.getAndDigest	SHA3-512	64	24.73%
> MessageDigests.getAndDigest	SHA3-512	16384	44.31%
> 
> 
> (results for intermediate input lengths look like steps)
> 
> On Graviton 4 there is still a noticeable difference between the proposed implementation and C2 generated code:
> 
> 
> Benchmark                    (digesterName)  (length)  Pct
> MessageDigests.digest              SHA3-256        64     8.3%
> MessageDigests.digest              SHA3-256     16384     11%
> MessageDigests.digest              SHA3-512        64     8.4%
> MessageDigests.digest              SHA3-512     16384     11.5%
> MessageDigests.getAndDigest        SHA3-256        64     7.2%
> MessageDigests.getAndDigest        SHA3-256     16384     11%
> MessageDigests.getAndDigest        SHA3-512        64     7.3%
> MessageDigests.getAndDigest        SHA3-512     16384     11.6%
> 
> 
> and the version that uses the extension is ~1.8x slower than C2
> 
> Existing intrinsic implementation is put under a flag `UseSIMDForSHA3Intrinsic` which is on by default where the intrinsic is enabled currently.
> 
> Sanity tests were modified to cover new intrinsic variants (`-XX:-UseSIMDForSHA3Intrinsic -XX:+-PreserveFramePointer`) on aarch64 hw. Existing test cases where intrinsic is enabled are executed with `-XX:+IgnoreUnrecognizedVMOptions -XX:+UseSIMDForSHA3Intrinsic`, on platforms where the sha3 extension is missing they still are cut off by isSHA3IntrinsicAvailable() predicate.
> 
> The original PR https://github.com/openjdk/jdk/pull/20422 has been auto-closed and the branch has been re-created on top of the new master.

Thanks. I think we need a bit more information. 

> On simpler cores like RPi3 or Surface Pro it is 23-53% faster than C2 compiled version

RPi3 is Cortex A53. It is ten years old. Which Surface Pro model are you referring to? How old is it?

The comparison I'd like to see is the difference between this code and the fastest existing version of SHA3, for any given hardware. It would also be nice for the automation to choose the fastest, for any given hardware.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/24260#issuecomment-2826765233