RFR: 8337666: AArch64: SHA3 GPR intrinsic

Fri Aug 30 10:18:18 UTC 2024

On Thu, 1 Aug 2024 14:38:12 GMT, Dmitry Chuyko <dchuyko at openjdk.org> wrote:

> This is an implementation of SHA3 intrinsics for AArch64 that operates GPRs. It follows the Java implementation algorithm but eagerly uses available registers. For example, FP+R18 are used when it's allowed. On simpler cores like RPi3 or Surface Pro it is 23-53% faster than C2 compiled version; on Graviton 3 it is 8-14% faster than C2 compiled version (which is faster than the current intrinsic); on Apple Silicon it is faster than C2 compiled version but slower than the ARMv8.2-SHA intrinsic. Improvements on a particular CPU depend on the input length. For instance, for Graviton 2:
> 
> 
> Benchmark (ops/ms)	(digesterName)	(length)	G2
> MessageDigests.digest	SHA3-256	64	28.28%
> MessageDigests.digest	SHA3-256	16384	53.58%
> MessageDigests.digest	SHA3-512	64	27.97%
> MessageDigests.digest	SHA3-512	16384	43.90%
> MessageDigests.getAndDigest	SHA3-256	64	26.18%
> MessageDigests.getAndDigest	SHA3-256	16384	52.82%
> MessageDigests.getAndDigest	SHA3-512	64	24.73%
> MessageDigests.getAndDigest	SHA3-512	16384	44.31%
> 
> 
> (results for intermediate input lengths look like steps)
> 
> Existing intrinsic implementation is put under a flag `UseSIMDForSHA3Intrinsic` which is on by default where the intrinsic is enabled currently.
> 
> Sanity tests were modified to cover new intrinsic variants (`-XX:-UseSIMDForSHA3Intrinsic -XX:+-PreserveFramePointer`) on aarch64 hw. Existing test cases where intrinsic is enabled are executed with `-XX:+IgnoreUnrecognizedVMOptions -XX:+UseSIMDForSHA3Intrinsic`, on platforms where the sha3 extension is missing they still are cut off by isSHA3IntrinsicAvailable() predicate.

This is an interesting one. My thoughts:

Keccak (SHA-3) is still not used much, mostly because it's slow. It was one of the slowest finalists in the SHA-3 competition. The main reason it was chosen is that it was so different from SHA-2, and the goal was to have something ready in case SHA-2 was broken. . But SHA-2 is still secure, and still standard. It will be the preferred has algorithm for the sofseeable future.

Keccak's slowness is for a few reasons: software implementations are slow, hardware implementations don't really exist, parallel modes for SHA-3 (https://nvlpubs.nist.gov/nistpubs/SpecialPublications/NIST.SP.800-185.pdf https://keccak.team/files/Sakura.pdf) are still not standardized, and SHA-3 has a truly humongous (i.e. unnecessary) safety margin.

There is hope that one day hardware implementations will become common, because Keccak is very efficient in hardware. But (I guess) manufacturers are reluctant to spend a lot of gates on this thing people don't much use.

The existing vectorized version of SHA-3 in AArch64 HotSpot depends on FEAT_SHA3, which I think is optional, so acceleration is nice to have on cores without FEAT_SHA3. But (as you say) this accelerated version offers a modest speedup over C2-compiled Java code.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/20422#issuecomment-2320757133