RFR: 8296548: Improve MD5 intrinsic for x86_64
Yi-Fan Tsai
duke at openjdk.org
Tue Nov 15 12:56:00 UTC 2022
On Wed, 9 Nov 2022 07:57:30 GMT, Yi-Fan Tsai <duke at openjdk.org> wrote:
> The LEA instruction loads the effective address, but MD5 intrinsic uses it for computing values than addresses. This usage potentially uses more cycles than ADDs and reduces the throughput.
>
> This change replaces
> LEA: r1 = r1 + rsi * 1 + t
> with
> ADDs: r1 += t; r1 += rsi.
>
> Microbenchmark evaluation shows ~40% performance improvement on Haswell, Broadwell, Skylake, and Cascade Lake. There is ~20% improvement on 2nd gen Epyc.
>
> No performance change for the same microbenchmark on Ice Lake and 3rd gen Epyc.
>
> Similar results can be observed with TestMD5Intrinsics and TestMD5MultiBlockIntrinsics. There is ~15% improvement in throughput on Haswell, Broadwell, Skylake, and Cascade Lake.
Performance without the optimization on Cascade Lake:
Benchmark (digesterName) (length) (provider) Mode Cnt Score Error Units
MessageDigests.digest md5 64 DEFAULT thrpt 15 3315.328 ± 65.799 ops/ms
MessageDigests.digest md5 16384 DEFAULT thrpt 15 27.482 ± 0.006 ops/ms
MessageDigests.getAndDigest md5 64 DEFAULT thrpt 15 2916.207 ± 127.293 ops/ms
MessageDigests.getAndDigest md5 16384 DEFAULT thrpt 15 27.381 ± 0.003 ops/ms
Performance with optimization on Cascade Lake:
Benchmark (digesterName) (length) (provider) Mode Cnt Score Error Units
MessageDigests.digest md5 64 DEFAULT thrpt 15 4474.780 ± 17.583 ops/ms
MessageDigests.digest md5 16384 DEFAULT thrpt 15 38.926 ± 0.005 ops/ms
MessageDigests.getAndDigest md5 64 DEFAULT thrpt 15 3796.684 ± 153.887 ops/ms
MessageDigests.getAndDigest md5 16384 DEFAULT thrpt 15 38.724 ± 0.005 ops/ms
-------------
PR: https://git.openjdk.org/jdk/pull/11054
More information about the hotspot-dev
mailing list