RFR: 8337666: AArch64: SHA3 GPR intrinsic
Andrew Haley
aph at openjdk.org
Thu Apr 24 08:22:48 UTC 2025
On Wed, 26 Mar 2025 15:55:59 GMT, Dmitry Chuyko <dchuyko at openjdk.org> wrote:
> This is an implementation of SHA3 intrinsics for AArch64 that operates GPRs. It follows the Java implementation algorithm but eagerly uses available registers. For example, FP+R18 are used when it's allowed. On simpler cores like RPi3 or Surface Pro it is 23-53% faster than C2 compiled version; on Graviton 3 it is 8-14% faster than C2 compiled version (which is faster than the current intrinsic); on Apple Silicon it is faster than C2 compiled version but slower than the ARMv8.2-SHA intrinsic. Improvements on a particular CPU depend on the input length. For instance, for Graviton 2:
>
>
> Benchmark (ops/ms) (digesterName) (length) G2
> MessageDigests.digest SHA3-256 64 28.28%
> MessageDigests.digest SHA3-256 16384 53.58%
> MessageDigests.digest SHA3-512 64 27.97%
> MessageDigests.digest SHA3-512 16384 43.90%
> MessageDigests.getAndDigest SHA3-256 64 26.18%
> MessageDigests.getAndDigest SHA3-256 16384 52.82%
> MessageDigests.getAndDigest SHA3-512 64 24.73%
> MessageDigests.getAndDigest SHA3-512 16384 44.31%
>
>
> (results for intermediate input lengths look like steps)
>
> On Graviton 4 there is still a noticeable difference between the proposed implementation and C2 generated code:
>
>
> Benchmark (digesterName) (length) Pct
> MessageDigests.digest SHA3-256 64 8.3%
> MessageDigests.digest SHA3-256 16384 11%
> MessageDigests.digest SHA3-512 64 8.4%
> MessageDigests.digest SHA3-512 16384 11.5%
> MessageDigests.getAndDigest SHA3-256 64 7.2%
> MessageDigests.getAndDigest SHA3-256 16384 11%
> MessageDigests.getAndDigest SHA3-512 64 7.3%
> MessageDigests.getAndDigest SHA3-512 16384 11.6%
>
>
> and the version that uses the extension is ~1.8x slower than C2
>
> Existing intrinsic implementation is put under a flag `UseSIMDForSHA3Intrinsic` which is on by default where the intrinsic is enabled currently.
>
> Sanity tests were modified to cover new intrinsic variants (`-XX:-UseSIMDForSHA3Intrinsic -XX:+-PreserveFramePointer`) on aarch64 hw. Existing test cases where intrinsic is enabled are executed with `-XX:+IgnoreUnrecognizedVMOptions -XX:+UseSIMDForSHA3Intrinsic`, on platforms where the sha3 extension is missing they still are cut off by isSHA3IntrinsicAvailable() predicate.
>
> The original PR https://github.com/openjdk/jdk/pull/20422 has been auto-closed and the branch has been re-created on top of the new master.
Thanks. I think we need a bit more information.
> On simpler cores like RPi3 or Surface Pro it is 23-53% faster than C2 compiled version
RPi3 is Cortex A53. It is ten years old. Which Surface Pro model are you referring to? How old is it?
The comparison I'd like to see is the difference between this code and the fastest existing version of SHA3, for any given hardware. It would also be nice for the automation to choose the fastest, for any given hardware.
-------------
PR Comment: https://git.openjdk.org/jdk/pull/24260#issuecomment-2826765233
More information about the hotspot-dev
mailing list