RFR: 8337666: AArch64: SHA3 GPR intrinsic
Andrew Haley
aph at openjdk.org
Fri Aug 30 10:18:18 UTC 2024
On Thu, 1 Aug 2024 14:38:12 GMT, Dmitry Chuyko <dchuyko at openjdk.org> wrote:
> This is an implementation of SHA3 intrinsics for AArch64 that operates GPRs. It follows the Java implementation algorithm but eagerly uses available registers. For example, FP+R18 are used when it's allowed. On simpler cores like RPi3 or Surface Pro it is 23-53% faster than C2 compiled version; on Graviton 3 it is 8-14% faster than C2 compiled version (which is faster than the current intrinsic); on Apple Silicon it is faster than C2 compiled version but slower than the ARMv8.2-SHA intrinsic. Improvements on a particular CPU depend on the input length. For instance, for Graviton 2:
>
>
> Benchmark (ops/ms) (digesterName) (length) G2
> MessageDigests.digest SHA3-256 64 28.28%
> MessageDigests.digest SHA3-256 16384 53.58%
> MessageDigests.digest SHA3-512 64 27.97%
> MessageDigests.digest SHA3-512 16384 43.90%
> MessageDigests.getAndDigest SHA3-256 64 26.18%
> MessageDigests.getAndDigest SHA3-256 16384 52.82%
> MessageDigests.getAndDigest SHA3-512 64 24.73%
> MessageDigests.getAndDigest SHA3-512 16384 44.31%
>
>
> (results for intermediate input lengths look like steps)
>
> Existing intrinsic implementation is put under a flag `UseSIMDForSHA3Intrinsic` which is on by default where the intrinsic is enabled currently.
>
> Sanity tests were modified to cover new intrinsic variants (`-XX:-UseSIMDForSHA3Intrinsic -XX:+-PreserveFramePointer`) on aarch64 hw. Existing test cases where intrinsic is enabled are executed with `-XX:+IgnoreUnrecognizedVMOptions -XX:+UseSIMDForSHA3Intrinsic`, on platforms where the sha3 extension is missing they still are cut off by isSHA3IntrinsicAvailable() predicate.
This is an interesting one. My thoughts:
Keccak (SHA-3) is still not used much, mostly because it's slow. It was one of the slowest finalists in the SHA-3 competition. The main reason it was chosen is that it was so different from SHA-2, and the goal was to have something ready in case SHA-2 was broken. . But SHA-2 is still secure, and still standard. It will be the preferred has algorithm for the sofseeable future.
Keccak's slowness is for a few reasons: software implementations are slow, hardware implementations don't really exist, parallel modes for SHA-3 (https://nvlpubs.nist.gov/nistpubs/SpecialPublications/NIST.SP.800-185.pdf https://keccak.team/files/Sakura.pdf) are still not standardized, and SHA-3 has a truly humongous (i.e. unnecessary) safety margin.
There is hope that one day hardware implementations will become common, because Keccak is very efficient in hardware. But (I guess) manufacturers are reluctant to spend a lot of gates on this thing people don't much use.
The existing vectorized version of SHA-3 in AArch64 HotSpot depends on FEAT_SHA3, which I think is optional, so acceleration is nice to have on cores without FEAT_SHA3. But (as you say) this accelerated version offers a modest speedup over C2-compiled Java code.
-------------
PR Comment: https://git.openjdk.org/jdk/pull/20422#issuecomment-2320757133
More information about the hotspot-dev
mailing list