RFR: 8295698: AArch64: test/jdk/sun/security/ec/ed/EdDSATest.java failed with -XX:+UseSHA3Intrinsics
Dong Bo
dongbo at openjdk.org
Thu Nov 3 11:50:50 UTC 2022
On Wed, 2 Nov 2022 09:21:57 GMT, Hao Sun <haosun at openjdk.org> wrote:
>> In JDK-8252204, when implemented SHA3 intrinsics, we use `digest_length` to differentiate SHA3-224, SHA3-256, SHA3-384, SHA3-512 and calculate `block_size` with `block_size = 200 - 2 * digest_length`.
>> However, there are two extra SHA3 instances, SHAKE256 and SHAKE128, allowing an arbitrary `digest_length`:
>>
>> digest_length block_size
>> SHA3-224 28 144
>> SHA3-256 32 136
>> SHA3-384 48 104
>> SHA3-512 64 72
>> SHAKE128 variable 168
>> SHAKE256 variable 136
>>
>>
>> This causes SIGSEGV crash or hash code mismatch with `test/jdk/sun/security/ec/ed/EdDSATest.java`. The test calls `SHAKE256` in `Ed448`.
>>
>> The main idea of the patch is to pass the `block_size` to differentiate SHA3 instances.
>> Tests `test/jdk/sun/security/ec/ed/EdDSATest.java` and `./test/jdk/sun/security/provider/MessageDigest/SHA3.java` both passed.
>> And tier1~3 passed on SHA3 supported hardware.
>>
>> The SHA3 intrinsics still deliver 20%~40% performance improvement on our pre-silicon simulated platform.
>> The latency and throughput of crypto SHA3 ops are designed to be 1 cpu cycle and 2 execution pipes respectively.
>>
>> Compared with the main stream code, the performance change with this patch are negligible on real hardware and simulation platform.
>> Based on the JMH results of SHA3 intirinsics, performance can be improved by ~50% on some hardware, while some hardware have ~30% regression.
>> These performance details are available in the comments of the issue page.
>> I guess the performance benefit of SHA3 intrinsics is dependent on the micro architecture, it should be switched on/off based on the running platform.
>
> Overall, it looks good to me. (I'm not a Reviewer)
>
> I agree with your fix to pass `block_size` rather than `digest_length`, but I didn't fully understand your update insides `stubGenerator_aarch64.cpp`, i.e. those operations before `rounds24_loop`, since I'm not a crypto expert.
> We may need some crypo expert to review that part.
>
> Note that I rerun tier1~3 with the latest commit in this PR on sha512 feature supported hardware, and the test passed.
>
> Thanks.
@shqking Thanks for the review.
Before the `NR (==24)` iterations in `keccak`, `a0~a24` need to be ready:
// SHA3.java
private void keccak() {
// convert the 200-byte state into 25 lanes
bytes2Lanes(state, lanes);
...
a0 = lanes[0]; a1 = lanes[1]; ...; a24 = lanes[24];
// process the lanes through step mappings
for (int ir = 0; ir < NR; ir++) {
long c0 = a0^a5^a10^a15^a20;
long c1 = a1^a6^a11^a16^a21;
...
While `a_i` is computed as: `state[i] ^= b[ofs++]` in `implCompress0(byte[] b, int ofs)`, `byte2Lanes(state, lanes)`, `a_i = lanes[i]`.
// SHA3.java
@IntrinsicCandidate
private void implCompress0(byte[] b, int ofs) {
for (int i = 0; i < buffer.length; i++) {
state[i] ^= b[ofs++];
}
keccak();
}
Those operations before `rounds24_loop` can be taken as vectorial version of the Java code shown above:
1. Load all elements of `state` array into vector registers (v0~v24).
2. Load the input data into `v25~v31`. The number of data to load is dependent on `block_size (== buffer.length)`.
3. Perform `state[i] ^ b[ofs]`.
We don't need instructions for `byte2Lanes`, cause it is already done with the `ld1_8B` instructions.
Hope this can address your confusion.
-------------
PR: https://git.openjdk.org/jdk/pull/10939
More information about the hotspot-compiler-dev
mailing list