RFR: 8295698: AArch64: test/jdk/sun/security/ec/ed/EdDSATest.java failed with -XX:+UseSHA3Intrinsics

Thu Nov 3 11:50:50 UTC 2022

On Wed, 2 Nov 2022 09:21:57 GMT, Hao Sun <haosun at openjdk.org> wrote:

>> In JDK-8252204, when implemented SHA3 intrinsics, we use `digest_length` to differentiate SHA3-224, SHA3-256, SHA3-384, SHA3-512 and calculate `block_size` with `block_size = 200 - 2 * digest_length`.
>> However, there are two extra SHA3 instances, SHAKE256 and SHAKE128, allowing an arbitrary `digest_length`:
>> 
>> 	digest_length	block_size
>> SHA3-224	28	144
>> SHA3-256	32	136
>> SHA3-384	48	104
>> SHA3-512	64	72
>> SHAKE128	variable	168
>> SHAKE256	variable	136
>> 
>> 
>> This causes SIGSEGV crash or hash code mismatch with `test/jdk/sun/security/ec/ed/EdDSATest.java`. The test calls `SHAKE256` in `Ed448`.
>> 
>> The main idea of the patch is to pass the `block_size` to differentiate SHA3 instances.
>> Tests `test/jdk/sun/security/ec/ed/EdDSATest.java` and `./test/jdk/sun/security/provider/MessageDigest/SHA3.java` both passed.
>> And tier1~3 passed on SHA3 supported hardware.
>> 
>> The SHA3 intrinsics still deliver 20%~40% performance improvement on our pre-silicon simulated platform.
>> The latency and throughput of crypto SHA3 ops are designed to be 1 cpu cycle and 2 execution pipes respectively.
>> 
>> Compared with the main stream code, the performance change with this patch are negligible on real hardware and simulation platform.
>> Based on the JMH results of SHA3 intirinsics, performance can be improved by ~50% on some hardware, while some hardware have ~30% regression.
>> These performance details are available in the comments of the issue page.
>> I guess the performance benefit of SHA3 intrinsics is dependent on the micro architecture, it should be switched on/off based on the running platform.
>
> Overall, it looks good to me. (I'm not a Reviewer)
> 
> I agree with your fix to pass `block_size` rather than `digest_length`, but I didn't fully understand your update insides `stubGenerator_aarch64.cpp`, i.e. those operations before `rounds24_loop`, since I'm not a crypto expert.
> We may need some crypo expert to review that part.
> 
> Note that I rerun tier1~3 with the latest commit in this PR on sha512 feature supported hardware, and the test passed.
> 
> Thanks.

@shqking Thanks for the review.

Before the `NR (==24)` iterations in `keccak`, `a0~a24` need to be ready:

// SHA3.java
    private void keccak() {
        // convert the 200-byte state into 25 lanes
        bytes2Lanes(state, lanes);
        ...
        a0 = lanes[0]; a1 = lanes[1]; ...; a24 = lanes[24];

        // process the lanes through step mappings
        for (int ir = 0; ir < NR; ir++) {
            long c0 = a0^a5^a10^a15^a20;
            long c1 = a1^a6^a11^a16^a21;
            ...

While `a_i` is computed as: `state[i] ^= b[ofs++]` in `implCompress0(byte[] b, int ofs)`, `byte2Lanes(state, lanes)`, `a_i = lanes[i]`.

// SHA3.java
    @IntrinsicCandidate
    private void implCompress0(byte[] b, int ofs) {
       for (int i = 0; i < buffer.length; i++) {
           state[i] ^= b[ofs++];
       }
       keccak();
    }

Those operations before `rounds24_loop` can be taken as vectorial version of the Java code shown above:
1. Load all elements of `state` array into vector registers (v0~v24).
2. Load the input data into `v25~v31`. The number of data to load is dependent on `block_size (== buffer.length)`.
3. Perform `state[i] ^ b[ofs]`.
We don't need instructions for `byte2Lanes`, cause it is already done with the `ld1_8B` instructions.

Hope this can address your confusion.

-------------

PR: https://git.openjdk.org/jdk/pull/10939