RFR: 8275914: SHA3: changing java implementation to help C2 create high-performance code

Boris Ulasevich bulasevich at openjdk.java.net
Wed Dec 15 09:29:22 UTC 2021


Background

The goal is to improve SHA3 implementation performance as it runs up to two times slower than native (OpenSSL, measured on AMD64 and AArch6464) implementation. Some hardware provides SHA3 accelerators, but most (AMD64 and most ARM64) do not.

For ARM64 hardware that does support SHA3 accelerators, an intrinsic was implemented for ARMv8.2 with SHA3 instructions support:
- [JDK-8252204](https://bugs.openjdk.java.net/browse/JDK-8252204): AArch64: Implement SHA3 accelerator/intrinsic

SHA3 java implementation have already been reworked to eliminate memory consumption and make it work faster:
- [JDK-8157495](https://bugs.openjdk.java.net/browse/JDK-8157495): SHA-3 Hash algorithm performance improvements (~12x speedup)
The latter issue addressed this previously, but there is still some room to improve SHA3 performance with current C2 implementation, which is proposed in this PR.

Option 1 (this PR)

With this change I unroll the internal cycles manually, inline it manually, and use local variables (not array) for data processing. Such an approach gives the best performance (see benchmark results). Without this change (with current array data processing) we observe a lot of load/store operations in comparison to processing in local variables, both on AMD64 and on AArch64.  
Native implementations shows that on AArch64 (32 GPU registers) SHA-3 algorithm can hold all 25 data and all temporary variables in registers. C2 can't optimize it as well because many regisers are allocated for internal usage: rscratch1, rscratch2, rmethod, rthread, etc.  With data in local variables the number of excessive load/stores is much smaller and performance result is much better.

Option 2 (alternative)

LINK: [the change](https://github.com/bulasevich/jdk/commit/095fc9db)

This is a more conservative change which minimizes code changes, but it has lower performance improvement. Please let me know if you think this change is better then main one: in this case I will replace it within this Pull Request.

With this change I unroll the internal cycles manually and use @Forceinline annotation. Manual unrolling is necessary because C2 does not recognize there is a counted cycle that can be completely unrolled. Instead of replacing the loop with five loop bodies C2 splits the loop to pre- main- and post- loop which is not good for this case. C2 works better when the array is created locally, but in our case the array is created on object instantiation, so C2 can't prove the array length is constant. The second issue affecting performance is that methods with unrolled loops get larger and C2 does not inline them. It's addressed here by using @Forceinline annotation.

Benchmark results

We measured the change on four platforms: ARM32, PPC, ARM64 (with no advanced SHA-3 instructions) and AMD on
  [MessageDigests](https://github.com/openjdk/jdk/blob/master/test/micro/org/openjdk/bench/java/security/MessageDigests.java) benchmark. Here is the result: http://cr.openjdk.java.net/~bulasevich/8275914/sha3_bench.html
- fix1 (red) is the Option 1 (this PR) change: gains 50/38/83/38% on ARM32/PPC/AArch64/AMD64
- fix2 (green) is the Option 2 (alternative) change: gains 23/33/40/17% on ARM32/PPC/AArch64/AMD64

Testing

Tested with JTREG and SHA-3 Project (NIST) test vectors: run SHA3 implementation over the same vectors before and after the change, and checking the results are the same.
The test tool: http://cr.openjdk.java.net/~bulasevich/8275914/RunKat.java
The test vectors compiled from the [Final Algorithm Package](https://csrc.nist.gov/Projects/Hash-Functions/SHA-3-Project):
http://cr.openjdk.java.net/~bulasevich/8275914/MsgKAT_224.txt
http://cr.openjdk.java.net/~bulasevich/8275914/MsgKAT_256.txt
http://cr.openjdk.java.net/~bulasevich/8275914/MsgKAT_384.txt
http://cr.openjdk.java.net/~bulasevich/8275914/MsgKAT_512.txt

-------------

Commit messages:
 - Apply manual inline and unroll, plus moving data into local variables to help C2 with SHA3 impl

Changes: https://git.openjdk.java.net/jdk/pull/6847/files
 Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=6847&range=00
  Issue: https://bugs.openjdk.java.net/browse/JDK-8275914
  Stats: 208 lines in 1 file changed: 114 ins; 93 del; 1 mod
  Patch: https://git.openjdk.java.net/jdk/pull/6847.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/6847/head:pull/6847

PR: https://git.openjdk.java.net/jdk/pull/6847


More information about the security-dev mailing list