Integrated: 8308465: Reduce memory accesses in AArch64 MD5 intrinsic

Yi-Fan Tsai duke at openjdk.org
Mon May 22 16:56:58 UTC 2023


On Sat, 20 May 2023 07:29:13 GMT, Yi-Fan Tsai <duke at openjdk.org> wrote:

> Two optimizations have been implemented in this change to reduce memory reads in AArch64 MD5 intrinsic.
> 
> **Optimization 1:** Memory loads and stores updating hash values are moved out of the loop. The final results are only written to memory once.
> 
> The original snippet loaded the value (step 3) soon after it was written to the memory (step 2). 
> 
> md5_loop:
>     __ ldrw(a, Address(state, 0));         // step 3: load the value from memory
>     ... // loop body
>     __ ldrw(rscratch1, Address(state, 0)); // step 1: load the value at Address(state, 0)
>     __ addw(rscratch1, rscratch1, a);
>     __ strw(rscratch1, Address(state, 0)); // step 2: write the value to memory
>     ...
>     __ br(Assembler::LE, md5_loop);
> 
> 
> The snippet is optimized to avoid memory loads and writes in the loop.
> 
>     __ ldp(s0, s1, Address(state,  0));    // load the value at Address(state, 0) to a register
>     __ ubfx(a, s0, 0, 32);
> md5_loop:
>     .. // body
>     __ ubfx(rscratch1, s0, 0, 32);         // step 1: extract the value from the register
>     __ addw(a, rscratch1, a);
>     __ orr(s0, a, b, Assembler::LSL, 32);  // step 2: preserve the value in the register
>     ....
>     __ br(Assembler::LE, md5_loop);
>     ....
>     __ str(s0, Address(state, 0));         // write the result to memory only once
> 
> 
> **Optimization 2**: Redundant loads generated by `md5_GG`, `md5_HH`, and `md5_II` are removed.
> 
> The original snippet, generated by two `md5_FF`s and `md5_GG`s, read the same data repeatedly.
> 
> __ ldrw(rscratch1, Address(buf, 0));    // from md5_FF(.., k = 0, ..)
> ...
> __ ldrw(rscratch1, Address(buf, 4));    // from md5_FF(.., k = 1, ..)
> ...
> __ ldrw(rscratch1, Address(buf, 4));    // from md5_GG(.., k = 1, ..)
> ...
> __ ldrw(rscratch1, Address(buf, 0));    // from md5_GG(.., k = 0, ..)
> 
> 
> The snippet is optimized by caching the values in registers and removing the redundant loads.
> 
> __ ldp (buf0, buf1, Address(buf, 0));  // load both values into buf0
> ...
> __ ubfx(rscratch1, buf0, 0, 32);       // extract the value of k = 0 from the lower 32 bits of buf0
> ...
> __ ubfx(rscratch1, buf0, 32, 32);      // extract the value of k = 1 from the higher 32 bits of buf0
> ...
> __ ubfx(rscratch1, buf0, 32, 32); 
> ...
> __ ubfx(rscratch1, buf0, 0, 32);
> 
> 
> 
> **Test**
> The following tests have passed.
> 
> test/hotspot/jtreg/compiler/intrinsics/sha/sanity/TestMD5Intrinsics.java
> test/hotspot/jtreg/compiler/intrinsics/sha/sanity/TestMD5MultiBlockIntrinsics.java
> 
> 
> **Performance*...

This pull request has now been integrated.

Changeset: 8474e693
Author:    Yi-Fan Tsai <yftsai at amazon.com>
Committer: Paul Hohensee <phh at openjdk.org>
URL:       https://git.openjdk.org/jdk/commit/8474e693b4404ba62927fe0e43e68b904d66fbde
Stats:     138 lines in 1 file changed: 44 ins; 11 del; 83 mod

8308465: Reduce memory accesses in AArch64 MD5 intrinsic

Reviewed-by: aph, phh

-------------

PR: https://git.openjdk.org/jdk/pull/14068


More information about the hotspot-compiler-dev mailing list