RFR: 8275914: SHA3: changing java implementation to help C2 create high-performance code

Wed Dec 15 14:25:56 UTC 2021

On Wed, 15 Dec 2021 09:20:38 GMT, Boris Ulasevich <bulasevich at openjdk.org> wrote:

> Background
> 
> The goal is to improve SHA3 implementation performance as it runs up to two times slower than native (OpenSSL, measured on AMD64 and AArch6464) implementation. Some hardware provides SHA3 accelerators, but most (AMD64 and most AArch64) do not.
> 
> For AArch64 hardware that does support SHA3 accelerators, an intrinsic was implemented for ARMv8.2 with SHA3 instructions support:
> - [JDK-8252204](https://bugs.openjdk.java.net/browse/JDK-8252204): AArch64: Implement SHA3 accelerator/intrinsic
> 
> SHA3 java implementation have already been reworked to eliminate memory consumption and make it work faster:
> - [JDK-8157495](https://bugs.openjdk.java.net/browse/JDK-8157495): SHA-3 Hash algorithm performance improvements (~12x speedup)
> The latter issue addressed this previously, but there is still some room to improve SHA3 performance with current C2 implementation, which is proposed in this PR.
> 
> Option 1 (this PR)
> 
> With this change I unroll the internal cycles manually, inline it manually, and use local variables (not array) for data processing. Such an approach gives the best performance (see benchmark results). Without this change (with current array data processing) we observe a lot of load/store operations in comparison to processing in local variables, both on AMD64 and on AArch64.  
> Native implementations shows that on AArch64 (32 GPU registers) SHA-3 algorithm can hold all 25 data and all temporary variables in registers. C2 can't optimize it as well because many regisers are allocated for internal usage: rscratch1, rscratch2, rmethod, rthread, etc.  With data in local variables the number of excessive load/stores is much smaller and performance result is much better.
> 
> Option 2 (alternative)
> 
> LINK: [the change](https://github.com/bulasevich/jdk/commit/095fc9db)
> 
> This is a more conservative change which minimizes code changes, but it has lower performance improvement. Please let me know if you think this change is better then main one: in this case I will replace it within this Pull Request.
> 
> With this change I unroll the internal cycles manually and use @Forceinline annotation. Manual unrolling is necessary because C2 does not recognize there is a counted cycle that can be completely unrolled. Instead of replacing the loop with five loop bodies C2 splits the loop to pre- main- and post- loop which is not good for this case. C2 works better when the array is created locally, but in our case the array is created on object instantiation, so C2 can't prove the array length is constant. The second issue affecting performance is that methods with unrolled loops get larger and C2 does not inline them. It's addressed here by using @Forceinline annotation.
> 
> Benchmark results
> 
> We measured the change on four platforms: ARM32, PPC, AArch64 (with no advanced SHA-3 instructions) and AMD on
>   [MessageDigests](https://github.com/openjdk/jdk/blob/master/test/micro/org/openjdk/bench/java/security/MessageDigests.java) benchmark. Here is the result: http://cr.openjdk.java.net/~bulasevich/8275914/sha3_bench.html
> - fix1 (red) is the Option 1 (this PR) change: gains 50/38/83/38% on ARM32/PPC/AArch64/AMD64
> - fix2 (green) is the Option 2 (alternative) change: gains 23/33/40/17% on ARM32/PPC/AArch64/AMD64
> 
> Testing
> 
> Tested with JTREG and SHA-3 Project (NIST) test vectors: run SHA3 implementation over the same vectors before and after the change, and checking the results are the same.
> The test tool: http://cr.openjdk.java.net/~bulasevich/8275914/RunKat.java
> The test vectors compiled from the [Final Algorithm Package](https://csrc.nist.gov/Projects/Hash-Functions/SHA-3-Project):
> http://cr.openjdk.java.net/~bulasevich/8275914/MsgKAT_224.txt
> http://cr.openjdk.java.net/~bulasevich/8275914/MsgKAT_256.txt
> http://cr.openjdk.java.net/~bulasevich/8275914/MsgKAT_384.txt
> http://cr.openjdk.java.net/~bulasevich/8275914/MsgKAT_512.txt

@kuksenko You used to improve this code before, could you share your opinion?

-------------

PR: https://git.openjdk.java.net/jdk/pull/6847