RFR: 8337666: AArch64: SHA3 GPR intrinsic [v8]

Andrew Haley aph at openjdk.org
Thu Jun 5 12:28:54 UTC 2025


On Thu, 5 Jun 2025 12:11:40 GMT, Dmitry Chuyko <dchuyko at openjdk.org> wrote:

>> This is an implementation of SHA3 intrinsics for AArch64 that operates GPRs. It follows the Java implementation algorithm but eagerly uses available registers. For example, FP+R18 are used when it's allowed. On simpler cores like RPi3 or Surface Pro it is 23-53% faster than C2 compiled version; on Graviton 3 it is 8-14% faster than C2 compiled version (which is faster than the current intrinsic); on Apple Silicon it is faster than C2 compiled version but slower than the ARMv8.2-SHA intrinsic. Improvements on a particular CPU depend on the input length. For instance, for Graviton 2:
>> 
>> 
>> G2
>> Benchmark                    (digesterName)  (length)	Pct
>> MessageDigests.digest              SHA3-256        64     28.28%
>> MessageDigests.digest              SHA3-256     16384     53.58%
>> MessageDigests.digest              SHA3-512        64     27.97%
>> MessageDigests.digest              SHA3-512     16384     43.90%
>> MessageDigests.getAndDigest        SHA3-256        64     26.18%
>> MessageDigests.getAndDigest        SHA3-256     16384     52.82%
>> MessageDigests.getAndDigest        SHA3-512        64     24.73%
>> MessageDigests.getAndDigest        SHA3-512     16384     44.31%
>> 
>> 
>> (results for intermediate input lengths look like steps)
>> 
>> On Graviton 4 there is still a noticeable difference between the proposed implementation and C2 generated code:
>> 
>> 
>> G4
>> Benchmark                    (digesterName)  (length)	Pct
>> MessageDigests.digest              SHA3-256        64     8.3%
>> MessageDigests.digest              SHA3-256     16384     11%
>> MessageDigests.digest              SHA3-512        64     8.4%
>> MessageDigests.digest              SHA3-512     16384     11.5%
>> MessageDigests.getAndDigest        SHA3-256        64     7.2%
>> MessageDigests.getAndDigest        SHA3-256     16384     11%
>> MessageDigests.getAndDigest        SHA3-512        64     7.3%
>> MessageDigests.getAndDigest        SHA3-512     16384     11.6%
>> 
>> 
>> and the version that uses the extension is ~1.8x slower than C2
>> 
>> Existing intrinsic implementation is put under a flag `UseSIMDForSHA3Intrinsic` which is on by default where the intrinsic is enabled currently.
>> 
>> Sanity tests were modified to cover new intrinsic variants (`-XX:-UseSIMDForSHA3Intrinsic -XX:+-PreserveFramePointer`) on aarch64 hw. Existing test cases where intrinsic is enabled are executed with `-XX:+IgnoreUnrecognizedVMOptions -XX:+UseSIMDForSHA3Intrinsic`, on platforms where the sha3 extension ...
>
> Dmitry Chuyko has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 12 commits:
> 
>  - Merge branch 'openjdk:master' into JDK-8337666
>  - No imm masking in rolw
>  - Merge branch 'openjdk:master' into JDK-8337666
>  - Update src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp
>    
>    Co-authored-by: Andrew Haley <aph-open at littlepinkcloud.com>
>  - Update src/hotspot/cpu/aarch64/macroAssembler_aarch64.hpp
>    
>    Co-authored-by: Andrew Haley <aph-open at littlepinkcloud.com>
>  - Merge branch 'openjdk:master' into JDK-8337666
>  - Assert message
>  - Copyright year
>  - Review suggestions
>  - Merge master
>  - ... and 2 more: https://git.openjdk.org/jdk/compare/782bbca4...37bda3c2

OK, thanks.

-------------

Marked as reviewed by aph (Reviewer).

PR Review: https://git.openjdk.org/jdk/pull/24260#pullrequestreview-2900117365


More information about the hotspot-dev mailing list