RFR: 8252204: AArch64: Implement SHA3 accelerator/intrinsic [v11]

Wed Aug 2 11:05:12 UTC 2023

On Wed, 21 Oct 2020 23:42:33 GMT, Fei Yang <fyang at openjdk.org> wrote:

>> Contributed-by: ard.biesheuvel at linaro.org, dongbo4 at huawei.com
>> 
>> This added an intrinsic for SHA3 using aarch64 v8.2 SHA3 Crypto Extensions.
>> Reference implementation for core SHA-3 transform using ARMv8.2 Crypto Extensions:
>> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/arch/arm64/crypto/sha3-ce-core.S?h=v5.4.52
>> 
>> Trivial adaptation in SHA3. implCompress is needed for the purpose of adding the intrinsic.
>> For SHA3, we need to pass one extra parameter "digestLength" to the stub for the calculation of block size.
>> "digestLength" is also used in for the EOR loop before keccak to differentiate different SHA3 variants.
>> 
>> We added jtreg tests for SHA3 and used QEMU system emulator which supports SHA3 instructions to test the functionality.
>> Patch passed jtreg tier1-3 tests with QEMU system emulator.
>> Also verified with jtreg tier1-3 tests without SHA3 instructions on aarch64-linux-gnu and x86_64-linux-gnu, to make sure that there's no regression.
>> 
>> We used one existing JMH test for performance test: test/micro/org/openjdk/bench/java/security/MessageDigests.java
>> We measured the performance benefit with an aarch64 cycle-accurate simulator.
>> Patch delivers 20% - 40% performance improvement depending on specific SHA3 digest length and size of the message.
>> 
>> For now, this feature will not be enabled automatically for aarch64. We can auto-enable this when it is fully tested on real hardware.  But for the above testing purposes, this is auto-enabled when the corresponding hardware feature is detected.
>
> Fei Yang has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Add if (isJDK16OrHigher()) check for SHA3 in CheckGraalIntrinsics.java

src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 3473:

> 3471:     __ bcax(v24, __ T16B, v24, v8,  v31);
> 3472: 
> 3473:     __ ld1r(v31, __ T2D, __ post(rscratch1, 8));

is it intentional to load 16 bytes and post-increment by 8?

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/207#discussion_r1281749663