RFR: 8308465: Reduce memory reads in AArch64 MD5 intrinsic
Yi-Fan Tsai
duke at openjdk.org
Sat May 20 07:35:52 UTC 2023
Two optimizations have been implemented in this change to reduce memory reads in AArch64 MD5 intrinsic.
**Optimization 1:** Memory loads and stores updating hash values are moved out of the loop. The final results are only written to memory once.
The original snippet loads the value (step 3) soon after it was written to the memory (step 2).
md5_loop:
__ ldrw(a, Address(state, 0)); // step 3: load the value from memory
... // loop body
__ ldrw(rscratch1, Address(state, 0)); // step 1: load the value at Address(state, 0)
__ addw(rscratch1, rscratch1, a);
__ strw(rscratch1, Address(state, 0)); // step 2: write the value to memory
...
__ br(Assembler::LE, md5_loop);
The snippet is optimized to avoid memory loads and writes in the loop.
__ ldp(s0, s1, Address(state, 0)); // load the value at Address(state, 0) to a register
__ ubfx(a, s0, 0, 32);
md5_loop:
.. // body
__ ubfx(rscratch1, s0, 0, 32); // step 1: extract the value from the register
__ addw(a, rscratch1, a);
__ orr(s0, a, b, Assembler::LSL, 32); // step 2: preserve the value in the register
....
__ br(Assembler::LE, md5_loop);
....
__ str(s0, Address(state, 0)); // write the result to memory only once
**Optimization 2**: Redundant loads generated by `md5_GG`, `md5_HH`, and `md5_II` are removed.
The original snippet, generated by two `md5_FF`s and `md5_GG`s, shows the same data was repeatedly read.
__ ldrw(rscratch1, Address(buf, 0)); // from md5_FF(.., k = 0, ..)
...
__ ldrw(rscratch1, Address(buf, 4)); // from md5_FF(.., k = 1, ..)
...
__ ldrw(rscratch1, Address(buf, 4)); // from md5_GG(.., k = 1, ..)
...
__ ldrw(rscratch1, Address(buf, 0)); // from md5_GG(.., k = 0, ..)
The snippet is optimized by caching the values in registers and removing the redundant loads.
__ ldp (buf0, buf1, Address(buf, 0)); // load both values into buf0
...
__ ubfx(rscratch1, buf0, 0, 32); // extract the value of k = 0 from the lower 32 bits of buf0
...
__ ubfx(rscratch1, buf0, 32, 32); // extract the value of k = 1 from the higher 32 bits of buf0
...
__ ubfx(rscratch1, buf0, 32, 32);
...
__ ubfx(rscratch1, buf0, 0, 32);
**Test**
The following tests have passed.
test/hotspot/jtreg/compiler/intrinsics/sha/sanity/TestMD5Intrinsics.java
test/hotspot/jtreg/compiler/intrinsics/sha/sanity/TestMD5MultiBlockIntrinsics.java
**Performance**
The performance is improved by ~ 1-2% with `micro:org.openjdk.bench.java.security.MessageDigests` on larger inputs.
*MessageDigests.digest* improvement
| | 64 | 256 | 1,024 | 4,096 | 16,384 | bytes |
|----------- |---------|--------|------|--------|--------|-------|
| Graviton 2 | -1.41% | 0.43% | 1.81% | 2.20% | 2.28% |
| Graviton 3 | -3.63% | -0.43% | 0.73% | 1.05% | 1.14% |
*MessageDigests.getAndDigest* improvement
| | 64 | 256 | 1,024 | 4,096 | 16,384 | bytes |
|----------- |---------|--------|-------|--------|--------|-------|
| Graviton 2 | -0.97% | 0.55% | 1.46% | 1.84% | 1.91% |
| Graviton 3 | -0.20% | 0.49% | 1.03% | 1.13% | 1.17% |
Graviton 2
Benchmark (digesterName) (length) (provider) Mode Cnt Score Error Units
---- baseline ------------------------------------------------------------------------------------------
MessageDigests.digest md5 64 DEFAULT thrpt 15 3709.849 ± 30.327 ops/ms
MessageDigests.digest md5 256 DEFAULT thrpt 15 1513.543 ± 0.616 ops/ms
MessageDigests.digest md5 1024 DEFAULT thrpt 15 462.135 ± 0.382 ops/ms
MessageDigests.digest md5 4096 DEFAULT thrpt 15 122.360 ± 0.024 ops/ms
MessageDigests.digest md5 16384 DEFAULT thrpt 15 31.037 ± 0.010 ops/ms
MessageDigests.getAndDigest md5 64 DEFAULT thrpt 15 2902.714 ± 92.908 ops/ms
MessageDigests.getAndDigest md5 256 DEFAULT thrpt 15 1395.815 ± 2.292 ops/ms
MessageDigests.getAndDigest md5 1024 DEFAULT thrpt 15 448.729 ± 7.343 ops/ms
MessageDigests.getAndDigest md5 4096 DEFAULT thrpt 15 120.616 ± 0.038 ops/ms
MessageDigests.getAndDigest md5 16384 DEFAULT thrpt 15 31.010 ± 0.007 ops/ms
---- optimized -----------------------------------------------------------------------------------------
MessageDigests.digest md5 64 DEFAULT thrpt 15 3657.658 ± 40.255 ops/ms
MessageDigests.digest md5 256 DEFAULT thrpt 15 1520.086 ± 6.095 ops/ms
MessageDigests.digest md5 1024 DEFAULT thrpt 15 470.505 ± 0.395 ops/ms
MessageDigests.digest md5 4096 DEFAULT thrpt 15 125.048 ± 0.044 ops/ms
MessageDigests.digest md5 16384 DEFAULT thrpt 15 31.744 ± 0.050 ops/ms
MessageDigests.getAndDigest md5 64 DEFAULT thrpt 15 2874.460 ± 95.028 ops/ms
MessageDigests.getAndDigest md5 256 DEFAULT thrpt 15 1403.462 ± 4.536 ops/ms
MessageDigests.getAndDigest md5 1024 DEFAULT thrpt 15 455.260 ± 6.794 ops/ms
MessageDigests.getAndDigest md5 4096 DEFAULT thrpt 15 122.836 ± 0.046 ops/ms
MessageDigests.getAndDigest md5 16384 DEFAULT thrpt 15 31.602 ± 0.024 ops/ms
Graviton 3
Benchmark (digesterName) (length) (provider) Mode Cnt Score Error Units
---- baseline ------------------------------------------------------------------------------------------
MessageDigests.digest md5 64 DEFAULT thrpt 15 4122.050 ± 8.495 ops/ms
MessageDigests.digest md5 256 DEFAULT thrpt 15 1634.045 ± 0.341 ops/ms
MessageDigests.digest md5 1024 DEFAULT thrpt 15 490.091 ± 0.072 ops/ms
MessageDigests.digest md5 4096 DEFAULT thrpt 15 129.017 ± 0.007 ops/ms
MessageDigests.digest md5 16384 DEFAULT thrpt 15 32.687 ± 0.002 ops/ms
MessageDigests.getAndDigest md5 64 DEFAULT thrpt 15 3212.170 ± 81.253 ops/ms
MessageDigests.getAndDigest md5 256 DEFAULT thrpt 15 1504.159 ± 1.091 ops/ms
MessageDigests.getAndDigest md5 1024 DEFAULT thrpt 15 476.164 ± 3.869 ops/ms
MessageDigests.getAndDigest md5 4096 DEFAULT thrpt 15 126.983 ± 0.011 ops/ms
MessageDigests.getAndDigest md5 16384 DEFAULT thrpt 15 32.546 ± 0.004 ops/ms
---- optimized -----------------------------------------------------------------------------------------
MessageDigests.digest md5 64 DEFAULT thrpt 15 3972.523 ± 8.753 ops/ms
MessageDigests.digest md5 256 DEFAULT thrpt 15 1627.038 ± 1.855 ops/ms
MessageDigests.digest md5 1024 DEFAULT thrpt 15 493.648 ± 0.064 ops/ms
MessageDigests.digest md5 4096 DEFAULT thrpt 15 130.371 ± 0.012 ops/ms
MessageDigests.digest md5 16384 DEFAULT thrpt 15 33.058 ± 0.002 ops/ms
MessageDigests.getAndDigest md5 64 DEFAULT thrpt 15 3205.779 ± 76.897 ops/ms
MessageDigests.getAndDigest md5 256 DEFAULT thrpt 15 1511.463 ± 2.209 ops/ms
MessageDigests.getAndDigest md5 1024 DEFAULT thrpt 15 481.071 ± 3.479 ops/ms
MessageDigests.getAndDigest md5 4096 DEFAULT thrpt 15 128.423 ± 0.015 ops/ms
MessageDigests.getAndDigest md5 16384 DEFAULT thrpt 15 32.928 ± 0.005 ops/ms
-------------
Commit messages:
- 8308465: Reduce memory reads in AArch64 MD5 intrinsic
Changes: https://git.openjdk.org/jdk/pull/14068/files
Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=14068&range=00
Issue: https://bugs.openjdk.org/browse/JDK-8308465
Stats: 139 lines in 1 file changed: 45 ins; 8 del; 86 mod
Patch: https://git.openjdk.org/jdk/pull/14068.diff
Fetch: git fetch https://git.openjdk.org/jdk.git pull/14068/head:pull/14068
PR: https://git.openjdk.org/jdk/pull/14068
More information about the hotspot-compiler-dev
mailing list