RFR: 8308465: Reduce memory reads in AArch64 MD5 intrinsic [v2]
Yi-Fan Tsai
duke at openjdk.org
Sat May 20 20:11:52 UTC 2023
> Two optimizations have been implemented in this change to reduce memory reads in AArch64 MD5 intrinsic.
>
> **Optimization 1:** Memory loads and stores updating hash values are moved out of the loop. The final results are only written to memory once.
>
> The original snippet loads the value (step 3) soon after it was written to the memory (step 2).
>
> md5_loop:
> __ ldrw(a, Address(state, 0)); // step 3: load the value from memory
> ... // loop body
> __ ldrw(rscratch1, Address(state, 0)); // step 1: load the value at Address(state, 0)
> __ addw(rscratch1, rscratch1, a);
> __ strw(rscratch1, Address(state, 0)); // step 2: write the value to memory
> ...
> __ br(Assembler::LE, md5_loop);
>
>
> The snippet is optimized to avoid memory loads and writes in the loop.
>
> __ ldp(s0, s1, Address(state, 0)); // load the value at Address(state, 0) to a register
> __ ubfx(a, s0, 0, 32);
> md5_loop:
> .. // body
> __ ubfx(rscratch1, s0, 0, 32); // step 1: extract the value from the register
> __ addw(a, rscratch1, a);
> __ orr(s0, a, b, Assembler::LSL, 32); // step 2: preserve the value in the register
> ....
> __ br(Assembler::LE, md5_loop);
> ....
> __ str(s0, Address(state, 0)); // write the result to memory only once
>
>
> **Optimization 2**: Redundant loads generated by `md5_GG`, `md5_HH`, and `md5_II` are removed.
>
> The original snippet, generated by two `md5_FF`s and `md5_GG`s, shows the same data was repeatedly read.
>
> __ ldrw(rscratch1, Address(buf, 0)); // from md5_FF(.., k = 0, ..)
> ...
> __ ldrw(rscratch1, Address(buf, 4)); // from md5_FF(.., k = 1, ..)
> ...
> __ ldrw(rscratch1, Address(buf, 4)); // from md5_GG(.., k = 1, ..)
> ...
> __ ldrw(rscratch1, Address(buf, 0)); // from md5_GG(.., k = 0, ..)
>
>
> The snippet is optimized by caching the values in registers and removing the redundant loads.
>
> __ ldp (buf0, buf1, Address(buf, 0)); // load both values into buf0
> ...
> __ ubfx(rscratch1, buf0, 0, 32); // extract the value of k = 0 from the lower 32 bits of buf0
> ...
> __ ubfx(rscratch1, buf0, 32, 32); // extract the value of k = 1 from the higher 32 bits of buf0
> ...
> __ ubfx(rscratch1, buf0, 32, 32);
> ...
> __ ubfx(rscratch1, buf0, 0, 32);
>
>
>
> **Test**
> The following tests have passed.
>
> test/hotspot/jtreg/compiler/intrinsics/sha/sanity/TestMD5Intrinsics.java
> test/hotspot/jtreg/compiler/intrinsics/sha/sanity/TestMD5MultiBlockIntrinsics.java
>
>
> **Per...
Yi-Fan Tsai has updated the pull request incrementally with one additional commit since the last revision:
Rename and optimize
-------------
Changes:
- all: https://git.openjdk.org/jdk/pull/14068/files
- new: https://git.openjdk.org/jdk/pull/14068/files/c9ae28a1..0fcb9d42
Webrevs:
- full: https://webrevs.openjdk.org/?repo=jdk&pr=14068&range=01
- incr: https://webrevs.openjdk.org/?repo=jdk&pr=14068&range=00-01
Stats: 22 lines in 1 file changed: 2 ins; 6 del; 14 mod
Patch: https://git.openjdk.org/jdk/pull/14068.diff
Fetch: git fetch https://git.openjdk.org/jdk.git pull/14068/head:pull/14068
PR: https://git.openjdk.org/jdk/pull/14068
More information about the hotspot-compiler-dev
mailing list