RFR: 8308465: Reduce memory reads in AArch64 MD5 intrinsic

Yi-Fan Tsai duke at openjdk.org
Sat May 20 07:35:52 UTC 2023


Two optimizations have been implemented in this change to reduce memory reads in AArch64 MD5 intrinsic.

**Optimization 1:** Memory loads and stores updating hash values are moved out of the loop. The final results are only written to memory once.

The original snippet loads the value (step 3) soon after it was written to the memory (step 2). 

md5_loop:
    __ ldrw(a, Address(state, 0));         // step 3: load the value from memory
    ... // loop body
    __ ldrw(rscratch1, Address(state, 0)); // step 1: load the value at Address(state, 0)
    __ addw(rscratch1, rscratch1, a);
    __ strw(rscratch1, Address(state, 0)); // step 2: write the value to memory
    ...
    __ br(Assembler::LE, md5_loop);


The snippet is optimized to avoid memory loads and writes in the loop.

    __ ldp(s0, s1, Address(state,  0));    // load the value at Address(state, 0) to a register
    __ ubfx(a, s0, 0, 32);
md5_loop:
    .. // body
    __ ubfx(rscratch1, s0, 0, 32);         // step 1: extract the value from the register
    __ addw(a, rscratch1, a);
    __ orr(s0, a, b, Assembler::LSL, 32);  // step 2: preserve the value in the register
    ....
    __ br(Assembler::LE, md5_loop);
    ....
    __ str(s0, Address(state, 0));         // write the result to memory only once


**Optimization 2**: Redundant loads generated by `md5_GG`, `md5_HH`, and `md5_II` are removed.

The original snippet, generated by two `md5_FF`s and `md5_GG`s, shows the same data was repeatedly read.

__ ldrw(rscratch1, Address(buf, 0));    // from md5_FF(.., k = 0, ..)
...
__ ldrw(rscratch1, Address(buf, 4));    // from md5_FF(.., k = 1, ..)
...
__ ldrw(rscratch1, Address(buf, 4));    // from md5_GG(.., k = 1, ..)
...
__ ldrw(rscratch1, Address(buf, 0));    // from md5_GG(.., k = 0, ..)


The snippet is optimized by caching the values in registers and removing the redundant loads.

__ ldp (buf0, buf1, Address(buf, 0));  // load both values into buf0
...
__ ubfx(rscratch1, buf0, 0, 32);       // extract the value of k = 0 from the lower 32 bits of buf0
...
__ ubfx(rscratch1, buf0, 32, 32);      // extract the value of k = 1 from the higher 32 bits of buf0
...
__ ubfx(rscratch1, buf0, 32, 32); 
...
__ ubfx(rscratch1, buf0, 0, 32);



**Test**
The following tests have passed.

test/hotspot/jtreg/compiler/intrinsics/sha/sanity/TestMD5Intrinsics.java
test/hotspot/jtreg/compiler/intrinsics/sha/sanity/TestMD5MultiBlockIntrinsics.java


**Performance**
The performance is improved by ~ 1-2% with `micro:org.openjdk.bench.java.security.MessageDigests` on larger inputs.

*MessageDigests.digest* improvement
|                   | 64         | 256     | 1,024 |  4,096 | 16,384 | bytes |
|----------- |---------|--------|------|--------|--------|-------|
| Graviton 2 | -1.41%  | 0.43% | 1.81% | 2.20% | 2.28% |
| Graviton 3 | -3.63% | -0.43% | 0.73% | 1.05% | 1.14% |

*MessageDigests.getAndDigest* improvement
|                   | 64         | 256      | 1,024  | 4,096  | 16,384 | bytes |
|----------- |---------|--------|-------|--------|--------|-------|
| Graviton 2 | -0.97%  | 0.55% | 1.46% | 1.84% | 1.91%   |
| Graviton 3 | -0.20%  | 0.49% | 1.03% | 1.13% | 1.17%   |

Graviton 2

Benchmark                    (digesterName)  (length)  (provider)   Mode  Cnt     Score    Error   Units
---- baseline ------------------------------------------------------------------------------------------
MessageDigests.digest                   md5        64     DEFAULT  thrpt   15  3709.849 ± 30.327  ops/ms
MessageDigests.digest                   md5       256     DEFAULT  thrpt   15  1513.543 ±  0.616  ops/ms
MessageDigests.digest                   md5      1024     DEFAULT  thrpt   15   462.135 ±  0.382  ops/ms
MessageDigests.digest                   md5      4096     DEFAULT  thrpt   15   122.360 ±  0.024  ops/ms
MessageDigests.digest                   md5     16384     DEFAULT  thrpt   15    31.037 ±  0.010  ops/ms
MessageDigests.getAndDigest             md5        64     DEFAULT  thrpt   15  2902.714 ± 92.908  ops/ms
MessageDigests.getAndDigest             md5       256     DEFAULT  thrpt   15  1395.815 ±  2.292  ops/ms
MessageDigests.getAndDigest             md5      1024     DEFAULT  thrpt   15   448.729 ±  7.343  ops/ms
MessageDigests.getAndDigest             md5      4096     DEFAULT  thrpt   15   120.616 ±  0.038  ops/ms
MessageDigests.getAndDigest             md5     16384     DEFAULT  thrpt   15    31.010 ±  0.007  ops/ms
---- optimized -----------------------------------------------------------------------------------------
MessageDigests.digest                   md5        64     DEFAULT  thrpt   15  3657.658 ± 40.255  ops/ms
MessageDigests.digest                   md5       256     DEFAULT  thrpt   15  1520.086 ±  6.095  ops/ms
MessageDigests.digest                   md5      1024     DEFAULT  thrpt   15   470.505 ±  0.395  ops/ms
MessageDigests.digest                   md5      4096     DEFAULT  thrpt   15   125.048 ±  0.044  ops/ms
MessageDigests.digest                   md5     16384     DEFAULT  thrpt   15    31.744 ±  0.050  ops/ms
MessageDigests.getAndDigest             md5        64     DEFAULT  thrpt   15  2874.460 ± 95.028  ops/ms
MessageDigests.getAndDigest             md5       256     DEFAULT  thrpt   15  1403.462 ±  4.536  ops/ms
MessageDigests.getAndDigest             md5      1024     DEFAULT  thrpt   15   455.260 ±  6.794  ops/ms
MessageDigests.getAndDigest             md5      4096     DEFAULT  thrpt   15   122.836 ±  0.046  ops/ms
MessageDigests.getAndDigest             md5     16384     DEFAULT  thrpt   15    31.602 ±  0.024  ops/ms


Graviton 3

Benchmark                    (digesterName)  (length)  (provider)   Mode  Cnt     Score    Error   Units
---- baseline ------------------------------------------------------------------------------------------
MessageDigests.digest                   md5        64     DEFAULT  thrpt   15  4122.050 ±  8.495  ops/ms
MessageDigests.digest                   md5       256     DEFAULT  thrpt   15  1634.045 ±  0.341  ops/ms
MessageDigests.digest                   md5      1024     DEFAULT  thrpt   15   490.091 ±  0.072  ops/ms
MessageDigests.digest                   md5      4096     DEFAULT  thrpt   15   129.017 ±  0.007  ops/ms
MessageDigests.digest                   md5     16384     DEFAULT  thrpt   15    32.687 ±  0.002  ops/ms
MessageDigests.getAndDigest             md5        64     DEFAULT  thrpt   15  3212.170 ± 81.253  ops/ms
MessageDigests.getAndDigest             md5       256     DEFAULT  thrpt   15  1504.159 ±  1.091  ops/ms
MessageDigests.getAndDigest             md5      1024     DEFAULT  thrpt   15   476.164 ±  3.869  ops/ms
MessageDigests.getAndDigest             md5      4096     DEFAULT  thrpt   15   126.983 ±  0.011  ops/ms
MessageDigests.getAndDigest             md5     16384     DEFAULT  thrpt   15    32.546 ±  0.004  ops/ms
---- optimized -----------------------------------------------------------------------------------------
MessageDigests.digest                   md5        64     DEFAULT  thrpt   15  3972.523 ±  8.753  ops/ms
MessageDigests.digest                   md5       256     DEFAULT  thrpt   15  1627.038 ±  1.855  ops/ms
MessageDigests.digest                   md5      1024     DEFAULT  thrpt   15   493.648 ±  0.064  ops/ms
MessageDigests.digest                   md5      4096     DEFAULT  thrpt   15   130.371 ±  0.012  ops/ms
MessageDigests.digest                   md5     16384     DEFAULT  thrpt   15    33.058 ±  0.002  ops/ms
MessageDigests.getAndDigest             md5        64     DEFAULT  thrpt   15  3205.779 ± 76.897  ops/ms
MessageDigests.getAndDigest             md5       256     DEFAULT  thrpt   15  1511.463 ±  2.209  ops/ms
MessageDigests.getAndDigest             md5      1024     DEFAULT  thrpt   15   481.071 ±  3.479  ops/ms
MessageDigests.getAndDigest             md5      4096     DEFAULT  thrpt   15   128.423 ±  0.015  ops/ms
MessageDigests.getAndDigest             md5     16384     DEFAULT  thrpt   15    32.928 ±  0.005  ops/ms

-------------

Commit messages:
 - 8308465: Reduce memory reads in AArch64 MD5 intrinsic

Changes: https://git.openjdk.org/jdk/pull/14068/files
 Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=14068&range=00
  Issue: https://bugs.openjdk.org/browse/JDK-8308465
  Stats: 139 lines in 1 file changed: 45 ins; 8 del; 86 mod
  Patch: https://git.openjdk.org/jdk/pull/14068.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/14068/head:pull/14068

PR: https://git.openjdk.org/jdk/pull/14068


More information about the hotspot-compiler-dev mailing list