RFR: 8259498: Reduce overhead of MD5 and SHA digests
- The MD5 intrinsics added by [JDK-8250902](https://bugs.openjdk.java.net/browse/JDK-8250902) shows that the `int[] x` isn't actually needed. This also applies to the SHA intrinsics from which the MD5 intrinsic takes inspiration - Using VarHandles we can simplify the code in `ByteArrayAccess` enough to make it acceptable to use inline and replace the array in MD5 wholesale. This improves performance both in the presence and the absence of the intrinsic optimization. - Doing the exact same thing in the SHA impls would be unwieldy (64+ element arrays), but allocating the array lazily gets most of the speed-up in the presence of an intrinsic while being neutral in its absence. Baseline: (digesterName) (length) Cnt Score Error Units MessageDigests.digest MD5 16 15 2714.307 ± 21.133 ops/ms MessageDigests.digest MD5 1024 15 318.087 ± 0.637 ops/ms MessageDigests.digest SHA-1 16 15 1387.266 ± 40.932 ops/ms MessageDigests.digest SHA-1 1024 15 109.273 ± 0.149 ops/ms MessageDigests.digest SHA-256 16 15 995.566 ± 21.186 ops/ms MessageDigests.digest SHA-256 1024 15 89.104 ± 0.079 ops/ms MessageDigests.digest SHA-512 16 15 803.030 ± 15.722 ops/ms MessageDigests.digest SHA-512 1024 15 115.611 ± 0.234 ops/ms MessageDigests.getAndDigest MD5 16 15 2190.367 ± 97.037 ops/ms MessageDigests.getAndDigest MD5 1024 15 302.903 ± 1.809 ops/ms MessageDigests.getAndDigest SHA-1 16 15 1262.656 ± 43.751 ops/ms MessageDigests.getAndDigest SHA-1 1024 15 104.889 ± 3.554 ops/ms MessageDigests.getAndDigest SHA-256 16 15 914.541 ± 55.621 ops/ms MessageDigests.getAndDigest SHA-256 1024 15 85.708 ± 1.394 ops/ms MessageDigests.getAndDigest SHA-512 16 15 737.719 ± 53.671 ops/ms MessageDigests.getAndDigest SHA-512 1024 15 112.307 ± 1.950 ops/ms GC: MessageDigests.getAndDigest:·gc.alloc.rate.norm MD5 16 15 312.011 ± 0.005 B/op MessageDigests.getAndDigest:·gc.alloc.rate.norm SHA-1 16 15 584.020 ± 0.006 B/op MessageDigests.getAndDigest:·gc.alloc.rate.norm SHA-256 16 15 544.019 ± 0.016 B/op MessageDigests.getAndDigest:·gc.alloc.rate.norm SHA-512 16 15 1056.037 ± 0.003 B/op Target: Benchmark (digesterName) (length) Cnt Score Error Units MessageDigests.digest MD5 16 15 3134.462 ± 43.685 ops/ms MessageDigests.digest MD5 1024 15 323.667 ± 0.633 ops/ms MessageDigests.digest SHA-1 16 15 1418.742 ± 38.223 ops/ms MessageDigests.digest SHA-1 1024 15 110.178 ± 0.788 ops/ms MessageDigests.digest SHA-256 16 15 1037.949 ± 21.214 ops/ms MessageDigests.digest SHA-256 1024 15 89.671 ± 0.228 ops/ms MessageDigests.digest SHA-512 16 15 812.028 ± 39.489 ops/ms MessageDigests.digest SHA-512 1024 15 116.738 ± 0.249 ops/ms MessageDigests.getAndDigest MD5 16 15 2314.379 ± 229.294 ops/ms MessageDigests.getAndDigest MD5 1024 15 307.835 ± 5.730 ops/ms MessageDigests.getAndDigest SHA-1 16 15 1326.887 ± 63.263 ops/ms MessageDigests.getAndDigest SHA-1 1024 15 106.611 ± 2.292 ops/ms MessageDigests.getAndDigest SHA-256 16 15 961.589 ± 82.052 ops/ms MessageDigests.getAndDigest SHA-256 1024 15 88.646 ± 0.194 ops/ms MessageDigests.getAndDigest SHA-512 16 15 775.417 ± 56.775 ops/ms MessageDigests.getAndDigest SHA-512 1024 15 112.904 ± 2.014 ops/ms GC MessageDigests.getAndDigest:·gc.alloc.rate.norm MD5 16 15 232.009 ± 0.006 B/op MessageDigests.getAndDigest:·gc.alloc.rate.norm SHA-1 16 15 584.021 ± 0.001 B/op MessageDigests.getAndDigest:·gc.alloc.rate.norm SHA-256 16 15 272.012 ± 0.015 B/op MessageDigests.getAndDigest:·gc.alloc.rate.norm SHA-512 16 15 400.017 ± 0.019 B/op For the `digest` micro digesting small inputs is faster with all algorithms, ranging from ~1% for SHA-512 up to ~15% for MD5. The gain stems from not allocating and reading into a temporary buffer once outside of the intrinsic. SHA-1 does not see a statistically gain because the intrinsic is disabled by default on my HW. For the `getAndDigest` micro - which tests `MessageDigest.getInstance(..).digest(..)` there are similar gains with this patch. The interesting aspect here is verifying the reduction in allocations per operation when there's an active intrinsic (again, not for SHA-1). JDK-8259065 (#1933) reduced allocations on each of these with 144B/op, which means allocation pressure for SHA-512 is down two thirds from 1200B/op to 400B/op in this contrived test. I've verified there are no regressions in the absence of the intrinsic - which the SHA-1 numbers here help show. ------------- Commit messages: - Remove unused Unsafe import - Harmonize MD4 impl, remove now-redundant checks from ByteArrayAccess (VHs do bounds checks, most of which will be optimized away) - Merge branch 'master' into improve_md5 - Apply allocation avoiding optimizations to all SHA versions sharing structural similarities with MD5 - Remove unused reverseBytes imports - Copyrights - Fix copy-paste error - Various fixes (IDE stopped IDEing..) - Add imports - mismatched parens - ... and 8 more: https://git.openjdk.java.net/jdk/compare/090bd3af...e1c943c5 Changes: https://git.openjdk.java.net/jdk/pull/1855/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=1855&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8259498 Stats: 649 lines in 8 files changed: 83 ins; 344 del; 222 mod Patch: https://git.openjdk.java.net/jdk/pull/1855.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/1855/head:pull/1855 PR: https://git.openjdk.java.net/jdk/pull/1855
On Sun, 20 Dec 2020 20:27:03 GMT, Claes Redestad <redestad@openjdk.org> wrote:
- The MD5 intrinsics added by [JDK-8250902](https://bugs.openjdk.java.net/browse/JDK-8250902) shows that the `int[] x` isn't actually needed. This also applies to the SHA intrinsics from which the MD5 intrinsic takes inspiration - Using VarHandles we can simplify the code in `ByteArrayAccess` enough to make it acceptable to use inline and replace the array in MD5 wholesale. This improves performance both in the presence and the absence of the intrinsic optimization. - Doing the exact same thing in the SHA impls would be unwieldy (64+ element arrays), but allocating the array lazily gets most of the speed-up in the presence of an intrinsic while being neutral in its absence.
Baseline: (digesterName) (length) Cnt Score Error Units MessageDigests.digest MD5 16 15 2714.307 ± 21.133 ops/ms MessageDigests.digest MD5 1024 15 318.087 ± 0.637 ops/ms MessageDigests.digest SHA-1 16 15 1387.266 ± 40.932 ops/ms MessageDigests.digest SHA-1 1024 15 109.273 ± 0.149 ops/ms MessageDigests.digest SHA-256 16 15 995.566 ± 21.186 ops/ms MessageDigests.digest SHA-256 1024 15 89.104 ± 0.079 ops/ms MessageDigests.digest SHA-512 16 15 803.030 ± 15.722 ops/ms MessageDigests.digest SHA-512 1024 15 115.611 ± 0.234 ops/ms MessageDigests.getAndDigest MD5 16 15 2190.367 ± 97.037 ops/ms MessageDigests.getAndDigest MD5 1024 15 302.903 ± 1.809 ops/ms MessageDigests.getAndDigest SHA-1 16 15 1262.656 ± 43.751 ops/ms MessageDigests.getAndDigest SHA-1 1024 15 104.889 ± 3.554 ops/ms MessageDigests.getAndDigest SHA-256 16 15 914.541 ± 55.621 ops/ms MessageDigests.getAndDigest SHA-256 1024 15 85.708 ± 1.394 ops/ms MessageDigests.getAndDigest SHA-512 16 15 737.719 ± 53.671 ops/ms MessageDigests.getAndDigest SHA-512 1024 15 112.307 ± 1.950 ops/ms
GC: MessageDigests.getAndDigest:·gc.alloc.rate.norm MD5 16 15 312.011 ± 0.005 B/op MessageDigests.getAndDigest:·gc.alloc.rate.norm SHA-1 16 15 584.020 ± 0.006 B/op MessageDigests.getAndDigest:·gc.alloc.rate.norm SHA-256 16 15 544.019 ± 0.016 B/op MessageDigests.getAndDigest:·gc.alloc.rate.norm SHA-512 16 15 1056.037 ± 0.003 B/op
Target: Benchmark (digesterName) (length) Cnt Score Error Units MessageDigests.digest MD5 16 15 3134.462 ± 43.685 ops/ms MessageDigests.digest MD5 1024 15 323.667 ± 0.633 ops/ms MessageDigests.digest SHA-1 16 15 1418.742 ± 38.223 ops/ms MessageDigests.digest SHA-1 1024 15 110.178 ± 0.788 ops/ms MessageDigests.digest SHA-256 16 15 1037.949 ± 21.214 ops/ms MessageDigests.digest SHA-256 1024 15 89.671 ± 0.228 ops/ms MessageDigests.digest SHA-512 16 15 812.028 ± 39.489 ops/ms MessageDigests.digest SHA-512 1024 15 116.738 ± 0.249 ops/ms MessageDigests.getAndDigest MD5 16 15 2314.379 ± 229.294 ops/ms MessageDigests.getAndDigest MD5 1024 15 307.835 ± 5.730 ops/ms MessageDigests.getAndDigest SHA-1 16 15 1326.887 ± 63.263 ops/ms MessageDigests.getAndDigest SHA-1 1024 15 106.611 ± 2.292 ops/ms MessageDigests.getAndDigest SHA-256 16 15 961.589 ± 82.052 ops/ms MessageDigests.getAndDigest SHA-256 1024 15 88.646 ± 0.194 ops/ms MessageDigests.getAndDigest SHA-512 16 15 775.417 ± 56.775 ops/ms MessageDigests.getAndDigest SHA-512 1024 15 112.904 ± 2.014 ops/ms
GC MessageDigests.getAndDigest:·gc.alloc.rate.norm MD5 16 15 232.009 ± 0.006 B/op MessageDigests.getAndDigest:·gc.alloc.rate.norm SHA-1 16 15 584.021 ± 0.001 B/op MessageDigests.getAndDigest:·gc.alloc.rate.norm SHA-256 16 15 272.012 ± 0.015 B/op MessageDigests.getAndDigest:·gc.alloc.rate.norm SHA-512 16 15 400.017 ± 0.019 B/op
For the `digest` micro digesting small inputs is faster with all algorithms, ranging from ~1% for SHA-512 up to ~15% for MD5. The gain stems from not allocating and reading into a temporary buffer once outside of the intrinsic. SHA-1 does not see a statistically gain because the intrinsic is disabled by default on my HW.
For the `getAndDigest` micro - which tests `MessageDigest.getInstance(..).digest(..)` there are similar gains with this patch. The interesting aspect here is verifying the reduction in allocations per operation when there's an active intrinsic (again, not for SHA-1). JDK-8259065 (#1933) reduced allocations on each of these with 144B/op, which means allocation pressure for SHA-512 is down two thirds from 1200B/op to 400B/op in this contrived test.
I've verified there are no regressions in the absence of the intrinsic - which the SHA-1 numbers here help show.
Since `java.util.UUID` and `sun.security.provider.MD5` are both in `java.base`, would it make sense to create new instances by calling `new MD5()` instead of `java.security.MessageDigest.getInstance("MD5")` and bypassing the whole MessageDigest logic? ------------- PR: https://git.openjdk.java.net/jdk/pull/1855
On Tue, 5 Jan 2021 21:51:51 GMT, DellCliff <github.com+14116124+DellCliff@openjdk.org> wrote:
- The MD5 intrinsics added by [JDK-8250902](https://bugs.openjdk.java.net/browse/JDK-8250902) shows that the `int[] x` isn't actually needed. This also applies to the SHA intrinsics from which the MD5 intrinsic takes inspiration - Using VarHandles we can simplify the code in `ByteArrayAccess` enough to make it acceptable to use inline and replace the array in MD5 wholesale. This improves performance both in the presence and the absence of the intrinsic optimization. - Doing the exact same thing in the SHA impls would be unwieldy (64+ element arrays), but allocating the array lazily gets most of the speed-up in the presence of an intrinsic while being neutral in its absence.
Baseline: (digesterName) (length) Cnt Score Error Units MessageDigests.digest MD5 16 15 2714.307 ± 21.133 ops/ms MessageDigests.digest MD5 1024 15 318.087 ± 0.637 ops/ms MessageDigests.digest SHA-1 16 15 1387.266 ± 40.932 ops/ms MessageDigests.digest SHA-1 1024 15 109.273 ± 0.149 ops/ms MessageDigests.digest SHA-256 16 15 995.566 ± 21.186 ops/ms MessageDigests.digest SHA-256 1024 15 89.104 ± 0.079 ops/ms MessageDigests.digest SHA-512 16 15 803.030 ± 15.722 ops/ms MessageDigests.digest SHA-512 1024 15 115.611 ± 0.234 ops/ms MessageDigests.getAndDigest MD5 16 15 2190.367 ± 97.037 ops/ms MessageDigests.getAndDigest MD5 1024 15 302.903 ± 1.809 ops/ms MessageDigests.getAndDigest SHA-1 16 15 1262.656 ± 43.751 ops/ms MessageDigests.getAndDigest SHA-1 1024 15 104.889 ± 3.554 ops/ms MessageDigests.getAndDigest SHA-256 16 15 914.541 ± 55.621 ops/ms MessageDigests.getAndDigest SHA-256 1024 15 85.708 ± 1.394 ops/ms MessageDigests.getAndDigest SHA-512 16 15 737.719 ± 53.671 ops/ms MessageDigests.getAndDigest SHA-512 1024 15 112.307 ± 1.950 ops/ms
GC: MessageDigests.getAndDigest:·gc.alloc.rate.norm MD5 16 15 312.011 ± 0.005 B/op MessageDigests.getAndDigest:·gc.alloc.rate.norm SHA-1 16 15 584.020 ± 0.006 B/op MessageDigests.getAndDigest:·gc.alloc.rate.norm SHA-256 16 15 544.019 ± 0.016 B/op MessageDigests.getAndDigest:·gc.alloc.rate.norm SHA-512 16 15 1056.037 ± 0.003 B/op
Target: Benchmark (digesterName) (length) Cnt Score Error Units MessageDigests.digest MD5 16 15 3134.462 ± 43.685 ops/ms MessageDigests.digest MD5 1024 15 323.667 ± 0.633 ops/ms MessageDigests.digest SHA-1 16 15 1418.742 ± 38.223 ops/ms MessageDigests.digest SHA-1 1024 15 110.178 ± 0.788 ops/ms MessageDigests.digest SHA-256 16 15 1037.949 ± 21.214 ops/ms MessageDigests.digest SHA-256 1024 15 89.671 ± 0.228 ops/ms MessageDigests.digest SHA-512 16 15 812.028 ± 39.489 ops/ms MessageDigests.digest SHA-512 1024 15 116.738 ± 0.249 ops/ms MessageDigests.getAndDigest MD5 16 15 2314.379 ± 229.294 ops/ms MessageDigests.getAndDigest MD5 1024 15 307.835 ± 5.730 ops/ms MessageDigests.getAndDigest SHA-1 16 15 1326.887 ± 63.263 ops/ms MessageDigests.getAndDigest SHA-1 1024 15 106.611 ± 2.292 ops/ms MessageDigests.getAndDigest SHA-256 16 15 961.589 ± 82.052 ops/ms MessageDigests.getAndDigest SHA-256 1024 15 88.646 ± 0.194 ops/ms MessageDigests.getAndDigest SHA-512 16 15 775.417 ± 56.775 ops/ms MessageDigests.getAndDigest SHA-512 1024 15 112.904 ± 2.014 ops/ms
GC MessageDigests.getAndDigest:·gc.alloc.rate.norm MD5 16 15 232.009 ± 0.006 B/op MessageDigests.getAndDigest:·gc.alloc.rate.norm SHA-1 16 15 584.021 ± 0.001 B/op MessageDigests.getAndDigest:·gc.alloc.rate.norm SHA-256 16 15 272.012 ± 0.015 B/op MessageDigests.getAndDigest:·gc.alloc.rate.norm SHA-512 16 15 400.017 ± 0.019 B/op
For the `digest` micro digesting small inputs is faster with all algorithms, ranging from ~1% for SHA-512 up to ~15% for MD5. The gain stems from not allocating and reading into a temporary buffer once outside of the intrinsic. SHA-1 does not see a statistically gain because the intrinsic is disabled by default on my HW.
For the `getAndDigest` micro - which tests `MessageDigest.getInstance(..).digest(..)` there are similar gains with this patch. The interesting aspect here is verifying the reduction in allocations per operation when there's an active intrinsic (again, not for SHA-1). JDK-8259065 (#1933) reduced allocations on each of these with 144B/op, which means allocation pressure for SHA-512 is down two thirds from 1200B/op to 400B/op in this contrived test.
I've verified there are no regressions in the absence of the intrinsic - which the SHA-1 numbers here help show.
Since `java.util.UUID` and `sun.security.provider.MD5` are both in `java.base`, would it make sense to create new instances by calling `new MD5()` instead of `java.security.MessageDigest.getInstance("MD5")` and bypassing the whole MessageDigest logic?
Are you sure you're not ending up paying more using a VarHandle and having to cast and using a var args call `(long) LONG_ARRAY_HANDLE.get(buf, ofs);` instead of creating a ByteBuffer once via `ByteBuffer.wrap(buffer).order(ByteOrder.nativeOrder()).asLongBuffer()`? ------------- PR: https://git.openjdk.java.net/jdk/pull/1855
On Tue, 5 Jan 2021 23:08:43 GMT, DellCliff <github.com+14116124+DellCliff@openjdk.org> wrote:
Since `java.util.UUID` and `sun.security.provider.MD5` are both in `java.base`, would it make sense to create new instances by calling `new MD5()` instead of `java.security.MessageDigest.getInstance("MD5")` and bypassing the whole MessageDigest logic?
Are you sure you're not ending up paying more using a VarHandle and having to cast and using a var args call `(long) LONG_ARRAY_HANDLE.get(buf, ofs);` instead of creating a ByteBuffer once via `ByteBuffer.wrap(buffer).order(ByteOrder.nativeOrder()).asLongBuffer()`?
Hitting up `new MD5()` directly could be a great idea. I expect this would be just as fast as the cache+clone (if not faster), but I'm a bit worried we'd be short-circuiting the ability to install an alternative MD5 provider (which may or may not be a thing we must support..), but it's worth exploring. Comparing performance of this against a `ByteBuffer` impl is on my TODO. The `VarHandle` gets heavily inlined and optimized here, though, with performance in my tests similar to the `Unsafe` use in `ByteArrayAccess`. ------------- PR: https://git.openjdk.java.net/jdk/pull/1855
On Wed, 6 Jan 2021 00:41:29 GMT, Claes Redestad <redestad@openjdk.org> wrote:
Are you sure you're not ending up paying more using a VarHandle and having to cast and using a var args call `(long) LONG_ARRAY_HANDLE.get(buf, ofs);` instead of creating a ByteBuffer once via `ByteBuffer.wrap(buffer).order(ByteOrder.nativeOrder()).asLongBuffer()`?
Hitting up `new MD5()` directly could be a great idea. I expect this would be just as fast as the cache+clone (if not faster), but I'm a bit worried we'd be short-circuiting the ability to install an alternative MD5 provider (which may or may not be a thing we must support..), but it's worth exploring.
Comparing performance of this against a `ByteBuffer` impl is on my TODO. The `VarHandle` gets heavily inlined and optimized here, though, with performance in my tests similar to the `Unsafe` use in `ByteArrayAccess`.
I've identified a number of optimizations to the plumbing behind `MessageDigest.getDigest(..)` over in #1933 that removes 80-90% of the throughput overhead and all the allocation overhead compared to the `clone()` approach prototyped here. The remaining 20ns/op overhead might not be enough of a concern to do a point fix in `UUID::nameUUIDFromBytes`. ------------- PR: https://git.openjdk.java.net/jdk/pull/1855
On Wed, 6 Jan 2021 01:27:52 GMT, Claes Redestad <redestad@openjdk.org> wrote:
Hitting up `new MD5()` directly could be a great idea. I expect this would be just as fast as the cache+clone (if not faster), but I'm a bit worried we'd be short-circuiting the ability to install an alternative MD5 provider (which may or may not be a thing we must support..), but it's worth exploring.
Comparing performance of this against a `ByteBuffer` impl is on my TODO. The `VarHandle` gets heavily inlined and optimized here, though, with performance in my tests similar to the `Unsafe` use in `ByteArrayAccess`.
I've identified a number of optimizations to the plumbing behind `MessageDigest.getDigest(..)` over in #1933 that removes 80-90% of the throughput overhead and all the allocation overhead compared to the `clone()` approach prototyped here. The remaining 20ns/op overhead might not be enough of a concern to do a point fix in `UUID::nameUUIDFromBytes`.
Removing the UUID clone cache and running the microbenchmark along with the changes in #1933: Benchmark (size) Mode Cnt Score Error Units UUIDBench.fromType3Bytes 20000 thrpt 12 2.182 ± 0.090 ops/us UUIDBench.fromType3Bytes:·gc.alloc.rate 20000 thrpt 12 439.020 ± 18.241 MB/sec UUIDBench.fromType3Bytes:·gc.alloc.rate.norm 20000 thrpt 12 264.022 ± 0.003 B/op The goal now is if to simplify the digest code and compare alternatives. ------------- PR: https://git.openjdk.java.net/jdk/pull/1855
On Thu, 7 Jan 2021 14:45:03 GMT, Claes Redestad <redestad@openjdk.org> wrote:
I've identified a number of optimizations to the plumbing behind `MessageDigest.getDigest(..)` over in #1933 that removes 80-90% of the throughput overhead and all the allocation overhead compared to the `clone()` approach prototyped here. The remaining 20ns/op overhead might not be enough of a concern to do a point fix in `UUID::nameUUIDFromBytes`.
Removing the UUID clone cache and running the microbenchmark along with the changes in #1933:
Benchmark (size) Mode Cnt Score Error Units UUIDBench.fromType3Bytes 20000 thrpt 12 2.182 ± 0.090 ops/us UUIDBench.fromType3Bytes:·gc.alloc.rate 20000 thrpt 12 439.020 ± 18.241 MB/sec UUIDBench.fromType3Bytes:·gc.alloc.rate.norm 20000 thrpt 12 264.022 ± 0.003 B/op
The goal now is if to simplify the digest code and compare alternatives.
I've run various tests and concluded that the `VarHandle`ized code is matching or improving upon the `Unsafe`-riddled code in `ByteArrayAccess`. I then went ahead and consolidated to use similar code pattern in `ByteArrayAccess` for consistency, which amounts to a good cleanup. With MD5 intrinsics disabled, I get this baseline: Benchmark (size) Mode Cnt Score Error Units UUIDBench.fromType3Bytes 20000 thrpt 12 1.245 ± 0.077 ops/us UUIDBench.fromType3Bytes:·gc.alloc.rate.norm 20000 thrpt 12 488.042 ± 0.004 B/op With the current patch here (not including #1933): Benchmark (size) Mode Cnt Score Error Units UUIDBench.fromType3Bytes 20000 thrpt 12 1.431 ± 0.106 ops/us UUIDBench.fromType3Bytes:·gc.alloc.rate.norm 20000 thrpt 12 408.035 ± 0.006 B/op If I isolate the `ByteArrayAccess` changes I'm getting performance neutral or slightly better numbers compared to baseline for these tests: Benchmark (size) Mode Cnt Score Error Units UUIDBench.fromType3Bytes 20000 thrpt 12 1.317 ± 0.092 ops/us UUIDBench.fromType3Bytes:·gc.alloc.rate.norm 20000 thrpt 12 488.042 ± 0.004 B/op ------------- PR: https://git.openjdk.java.net/jdk/pull/1855
On Thu, 7 Jan 2021 18:50:05 GMT, Claes Redestad <redestad@openjdk.org> wrote:
Removing the UUID clone cache and running the microbenchmark along with the changes in #1933:
Benchmark (size) Mode Cnt Score Error Units UUIDBench.fromType3Bytes 20000 thrpt 12 2.182 ± 0.090 ops/us UUIDBench.fromType3Bytes:·gc.alloc.rate 20000 thrpt 12 439.020 ± 18.241 MB/sec UUIDBench.fromType3Bytes:·gc.alloc.rate.norm 20000 thrpt 12 264.022 ± 0.003 B/op
The goal now is if to simplify the digest code and compare alternatives.
I've run various tests and concluded that the `VarHandle`ized code is matching or improving upon the `Unsafe`-riddled code in `ByteArrayAccess`. I then went ahead and consolidated to use similar code pattern in `ByteArrayAccess` for consistency, which amounts to a good cleanup.
With MD5 intrinsics disabled, I get this baseline:
Benchmark (size) Mode Cnt Score Error Units UUIDBench.fromType3Bytes 20000 thrpt 12 1.245 ± 0.077 ops/us UUIDBench.fromType3Bytes:·gc.alloc.rate.norm 20000 thrpt 12 488.042 ± 0.004 B/op
With the current patch here (not including #1933): Benchmark (size) Mode Cnt Score Error Units UUIDBench.fromType3Bytes 20000 thrpt 12 1.431 ± 0.106 ops/us UUIDBench.fromType3Bytes:·gc.alloc.rate.norm 20000 thrpt 12 408.035 ± 0.006 B/op
If I isolate the `ByteArrayAccess` changes I'm getting performance neutral or slightly better numbers compared to baseline for these tests:
Benchmark (size) Mode Cnt Score Error Units UUIDBench.fromType3Bytes 20000 thrpt 12 1.317 ± 0.092 ops/us UUIDBench.fromType3Bytes:·gc.alloc.rate.norm 20000 thrpt 12 488.042 ± 0.004 B/op
Thanks for the performance enhancement, I will take a look. ------------- PR: https://git.openjdk.java.net/jdk/pull/1855
On Sun, 20 Dec 2020 20:27:03 GMT, Claes Redestad <redestad@openjdk.org> wrote:
- The MD5 intrinsics added by [JDK-8250902](https://bugs.openjdk.java.net/browse/JDK-8250902) shows that the `int[] x` isn't actually needed. This also applies to the SHA intrinsics from which the MD5 intrinsic takes inspiration - Using VarHandles we can simplify the code in `ByteArrayAccess` enough to make it acceptable to use inline and replace the array in MD5 wholesale. This improves performance both in the presence and the absence of the intrinsic optimization. - Doing the exact same thing in the SHA impls would be unwieldy (64+ element arrays), but allocating the array lazily gets most of the speed-up in the presence of an intrinsic while being neutral in its absence.
Baseline: (digesterName) (length) Cnt Score Error Units MessageDigests.digest MD5 16 15 2714.307 ± 21.133 ops/ms MessageDigests.digest MD5 1024 15 318.087 ± 0.637 ops/ms MessageDigests.digest SHA-1 16 15 1387.266 ± 40.932 ops/ms MessageDigests.digest SHA-1 1024 15 109.273 ± 0.149 ops/ms MessageDigests.digest SHA-256 16 15 995.566 ± 21.186 ops/ms MessageDigests.digest SHA-256 1024 15 89.104 ± 0.079 ops/ms MessageDigests.digest SHA-512 16 15 803.030 ± 15.722 ops/ms MessageDigests.digest SHA-512 1024 15 115.611 ± 0.234 ops/ms MessageDigests.getAndDigest MD5 16 15 2190.367 ± 97.037 ops/ms MessageDigests.getAndDigest MD5 1024 15 302.903 ± 1.809 ops/ms MessageDigests.getAndDigest SHA-1 16 15 1262.656 ± 43.751 ops/ms MessageDigests.getAndDigest SHA-1 1024 15 104.889 ± 3.554 ops/ms MessageDigests.getAndDigest SHA-256 16 15 914.541 ± 55.621 ops/ms MessageDigests.getAndDigest SHA-256 1024 15 85.708 ± 1.394 ops/ms MessageDigests.getAndDigest SHA-512 16 15 737.719 ± 53.671 ops/ms MessageDigests.getAndDigest SHA-512 1024 15 112.307 ± 1.950 ops/ms
GC: MessageDigests.getAndDigest:·gc.alloc.rate.norm MD5 16 15 312.011 ± 0.005 B/op MessageDigests.getAndDigest:·gc.alloc.rate.norm SHA-1 16 15 584.020 ± 0.006 B/op MessageDigests.getAndDigest:·gc.alloc.rate.norm SHA-256 16 15 544.019 ± 0.016 B/op MessageDigests.getAndDigest:·gc.alloc.rate.norm SHA-512 16 15 1056.037 ± 0.003 B/op
Target: Benchmark (digesterName) (length) Cnt Score Error Units MessageDigests.digest MD5 16 15 3134.462 ± 43.685 ops/ms MessageDigests.digest MD5 1024 15 323.667 ± 0.633 ops/ms MessageDigests.digest SHA-1 16 15 1418.742 ± 38.223 ops/ms MessageDigests.digest SHA-1 1024 15 110.178 ± 0.788 ops/ms MessageDigests.digest SHA-256 16 15 1037.949 ± 21.214 ops/ms MessageDigests.digest SHA-256 1024 15 89.671 ± 0.228 ops/ms MessageDigests.digest SHA-512 16 15 812.028 ± 39.489 ops/ms MessageDigests.digest SHA-512 1024 15 116.738 ± 0.249 ops/ms MessageDigests.getAndDigest MD5 16 15 2314.379 ± 229.294 ops/ms MessageDigests.getAndDigest MD5 1024 15 307.835 ± 5.730 ops/ms MessageDigests.getAndDigest SHA-1 16 15 1326.887 ± 63.263 ops/ms MessageDigests.getAndDigest SHA-1 1024 15 106.611 ± 2.292 ops/ms MessageDigests.getAndDigest SHA-256 16 15 961.589 ± 82.052 ops/ms MessageDigests.getAndDigest SHA-256 1024 15 88.646 ± 0.194 ops/ms MessageDigests.getAndDigest SHA-512 16 15 775.417 ± 56.775 ops/ms MessageDigests.getAndDigest SHA-512 1024 15 112.904 ± 2.014 ops/ms
GC MessageDigests.getAndDigest:·gc.alloc.rate.norm MD5 16 15 232.009 ± 0.006 B/op MessageDigests.getAndDigest:·gc.alloc.rate.norm SHA-1 16 15 584.021 ± 0.001 B/op MessageDigests.getAndDigest:·gc.alloc.rate.norm SHA-256 16 15 272.012 ± 0.015 B/op MessageDigests.getAndDigest:·gc.alloc.rate.norm SHA-512 16 15 400.017 ± 0.019 B/op
For the `digest` micro digesting small inputs is faster with all algorithms, ranging from ~1% for SHA-512 up to ~15% for MD5. The gain stems from not allocating and reading into a temporary buffer once outside of the intrinsic. SHA-1 does not see a statistically gain because the intrinsic is disabled by default on my HW.
For the `getAndDigest` micro - which tests `MessageDigest.getInstance(..).digest(..)` there are similar gains with this patch. The interesting aspect here is verifying the reduction in allocations per operation when there's an active intrinsic (again, not for SHA-1). JDK-8259065 (#1933) reduced allocations on each of these with 144B/op, which means allocation pressure for SHA-512 is down two thirds from 1200B/op to 400B/op in this contrived test.
I've verified there are no regressions in the absence of the intrinsic - which the SHA-1 numbers here help show.
src/java.base/share/classes/sun/security/provider/ByteArrayAccess.java line 214: Why do we remove the index checking from all methods? Isn't it safer to check here in case the caller didn't? Or is it such checking is already implemented inside the the various methods of VarHandle? ------------- PR: https://git.openjdk.java.net/jdk/pull/1855
On Fri, 15 Jan 2021 22:54:32 GMT, Valerie Peng <valeriep@openjdk.org> wrote:
- The MD5 intrinsics added by [JDK-8250902](https://bugs.openjdk.java.net/browse/JDK-8250902) shows that the `int[] x` isn't actually needed. This also applies to the SHA intrinsics from which the MD5 intrinsic takes inspiration - Using VarHandles we can simplify the code in `ByteArrayAccess` enough to make it acceptable to use inline and replace the array in MD5 wholesale. This improves performance both in the presence and the absence of the intrinsic optimization. - Doing the exact same thing in the SHA impls would be unwieldy (64+ element arrays), but allocating the array lazily gets most of the speed-up in the presence of an intrinsic while being neutral in its absence.
Baseline: (digesterName) (length) Cnt Score Error Units MessageDigests.digest MD5 16 15 2714.307 ± 21.133 ops/ms MessageDigests.digest MD5 1024 15 318.087 ± 0.637 ops/ms MessageDigests.digest SHA-1 16 15 1387.266 ± 40.932 ops/ms MessageDigests.digest SHA-1 1024 15 109.273 ± 0.149 ops/ms MessageDigests.digest SHA-256 16 15 995.566 ± 21.186 ops/ms MessageDigests.digest SHA-256 1024 15 89.104 ± 0.079 ops/ms MessageDigests.digest SHA-512 16 15 803.030 ± 15.722 ops/ms MessageDigests.digest SHA-512 1024 15 115.611 ± 0.234 ops/ms MessageDigests.getAndDigest MD5 16 15 2190.367 ± 97.037 ops/ms MessageDigests.getAndDigest MD5 1024 15 302.903 ± 1.809 ops/ms MessageDigests.getAndDigest SHA-1 16 15 1262.656 ± 43.751 ops/ms MessageDigests.getAndDigest SHA-1 1024 15 104.889 ± 3.554 ops/ms MessageDigests.getAndDigest SHA-256 16 15 914.541 ± 55.621 ops/ms MessageDigests.getAndDigest SHA-256 1024 15 85.708 ± 1.394 ops/ms MessageDigests.getAndDigest SHA-512 16 15 737.719 ± 53.671 ops/ms MessageDigests.getAndDigest SHA-512 1024 15 112.307 ± 1.950 ops/ms
GC: MessageDigests.getAndDigest:·gc.alloc.rate.norm MD5 16 15 312.011 ± 0.005 B/op MessageDigests.getAndDigest:·gc.alloc.rate.norm SHA-1 16 15 584.020 ± 0.006 B/op MessageDigests.getAndDigest:·gc.alloc.rate.norm SHA-256 16 15 544.019 ± 0.016 B/op MessageDigests.getAndDigest:·gc.alloc.rate.norm SHA-512 16 15 1056.037 ± 0.003 B/op
Target: Benchmark (digesterName) (length) Cnt Score Error Units MessageDigests.digest MD5 16 15 3134.462 ± 43.685 ops/ms MessageDigests.digest MD5 1024 15 323.667 ± 0.633 ops/ms MessageDigests.digest SHA-1 16 15 1418.742 ± 38.223 ops/ms MessageDigests.digest SHA-1 1024 15 110.178 ± 0.788 ops/ms MessageDigests.digest SHA-256 16 15 1037.949 ± 21.214 ops/ms MessageDigests.digest SHA-256 1024 15 89.671 ± 0.228 ops/ms MessageDigests.digest SHA-512 16 15 812.028 ± 39.489 ops/ms MessageDigests.digest SHA-512 1024 15 116.738 ± 0.249 ops/ms MessageDigests.getAndDigest MD5 16 15 2314.379 ± 229.294 ops/ms MessageDigests.getAndDigest MD5 1024 15 307.835 ± 5.730 ops/ms MessageDigests.getAndDigest SHA-1 16 15 1326.887 ± 63.263 ops/ms MessageDigests.getAndDigest SHA-1 1024 15 106.611 ± 2.292 ops/ms MessageDigests.getAndDigest SHA-256 16 15 961.589 ± 82.052 ops/ms MessageDigests.getAndDigest SHA-256 1024 15 88.646 ± 0.194 ops/ms MessageDigests.getAndDigest SHA-512 16 15 775.417 ± 56.775 ops/ms MessageDigests.getAndDigest SHA-512 1024 15 112.904 ± 2.014 ops/ms
GC MessageDigests.getAndDigest:·gc.alloc.rate.norm MD5 16 15 232.009 ± 0.006 B/op MessageDigests.getAndDigest:·gc.alloc.rate.norm SHA-1 16 15 584.021 ± 0.001 B/op MessageDigests.getAndDigest:·gc.alloc.rate.norm SHA-256 16 15 272.012 ± 0.015 B/op MessageDigests.getAndDigest:·gc.alloc.rate.norm SHA-512 16 15 400.017 ± 0.019 B/op
For the `digest` micro digesting small inputs is faster with all algorithms, ranging from ~1% for SHA-512 up to ~15% for MD5. The gain stems from not allocating and reading into a temporary buffer once outside of the intrinsic. SHA-1 does not see a statistically gain because the intrinsic is disabled by default on my HW.
For the `getAndDigest` micro - which tests `MessageDigest.getInstance(..).digest(..)` there are similar gains with this patch. The interesting aspect here is verifying the reduction in allocations per operation when there's an active intrinsic (again, not for SHA-1). JDK-8259065 (#1933) reduced allocations on each of these with 144B/op, which means allocation pressure for SHA-512 is down two thirds from 1200B/op to 400B/op in this contrived test.
I've verified there are no regressions in the absence of the intrinsic - which the SHA-1 numbers here help show.
src/java.base/share/classes/sun/security/provider/ByteArrayAccess.java line 214:
Why do we remove the index checking from all methods? Isn't it safer to check here in case the caller didn't? Or is it such checking is already implemented inside the the various methods of VarHandle?
Yes, IOOBE checking is done by the VarHandle methods, while the Unsafe API is unsafe and needs careful precondition checking. It doesn't seem to matter for performance (interpreted code sees some benefit by the removal). With the current usage an IOOBE is probably not observable, but there's a test that reflects into ByteArrayAccess and verifies exceptions are thrown as expected on faulty inputs. ------------- PR: https://git.openjdk.java.net/jdk/pull/1855
On Sun, 20 Dec 2020 20:27:03 GMT, Claes Redestad <redestad@openjdk.org> wrote:
- The MD5 intrinsics added by [JDK-8250902](https://bugs.openjdk.java.net/browse/JDK-8250902) shows that the `int[] x` isn't actually needed. This also applies to the SHA intrinsics from which the MD5 intrinsic takes inspiration - Using VarHandles we can simplify the code in `ByteArrayAccess` enough to make it acceptable to use inline and replace the array in MD5 wholesale. This improves performance both in the presence and the absence of the intrinsic optimization. - Doing the exact same thing in the SHA impls would be unwieldy (64+ element arrays), but allocating the array lazily gets most of the speed-up in the presence of an intrinsic while being neutral in its absence.
Baseline: (digesterName) (length) Cnt Score Error Units MessageDigests.digest MD5 16 15 2714.307 ± 21.133 ops/ms MessageDigests.digest MD5 1024 15 318.087 ± 0.637 ops/ms MessageDigests.digest SHA-1 16 15 1387.266 ± 40.932 ops/ms MessageDigests.digest SHA-1 1024 15 109.273 ± 0.149 ops/ms MessageDigests.digest SHA-256 16 15 995.566 ± 21.186 ops/ms MessageDigests.digest SHA-256 1024 15 89.104 ± 0.079 ops/ms MessageDigests.digest SHA-512 16 15 803.030 ± 15.722 ops/ms MessageDigests.digest SHA-512 1024 15 115.611 ± 0.234 ops/ms MessageDigests.getAndDigest MD5 16 15 2190.367 ± 97.037 ops/ms MessageDigests.getAndDigest MD5 1024 15 302.903 ± 1.809 ops/ms MessageDigests.getAndDigest SHA-1 16 15 1262.656 ± 43.751 ops/ms MessageDigests.getAndDigest SHA-1 1024 15 104.889 ± 3.554 ops/ms MessageDigests.getAndDigest SHA-256 16 15 914.541 ± 55.621 ops/ms MessageDigests.getAndDigest SHA-256 1024 15 85.708 ± 1.394 ops/ms MessageDigests.getAndDigest SHA-512 16 15 737.719 ± 53.671 ops/ms MessageDigests.getAndDigest SHA-512 1024 15 112.307 ± 1.950 ops/ms
GC: MessageDigests.getAndDigest:·gc.alloc.rate.norm MD5 16 15 312.011 ± 0.005 B/op MessageDigests.getAndDigest:·gc.alloc.rate.norm SHA-1 16 15 584.020 ± 0.006 B/op MessageDigests.getAndDigest:·gc.alloc.rate.norm SHA-256 16 15 544.019 ± 0.016 B/op MessageDigests.getAndDigest:·gc.alloc.rate.norm SHA-512 16 15 1056.037 ± 0.003 B/op
Target: Benchmark (digesterName) (length) Cnt Score Error Units MessageDigests.digest MD5 16 15 3134.462 ± 43.685 ops/ms MessageDigests.digest MD5 1024 15 323.667 ± 0.633 ops/ms MessageDigests.digest SHA-1 16 15 1418.742 ± 38.223 ops/ms MessageDigests.digest SHA-1 1024 15 110.178 ± 0.788 ops/ms MessageDigests.digest SHA-256 16 15 1037.949 ± 21.214 ops/ms MessageDigests.digest SHA-256 1024 15 89.671 ± 0.228 ops/ms MessageDigests.digest SHA-512 16 15 812.028 ± 39.489 ops/ms MessageDigests.digest SHA-512 1024 15 116.738 ± 0.249 ops/ms MessageDigests.getAndDigest MD5 16 15 2314.379 ± 229.294 ops/ms MessageDigests.getAndDigest MD5 1024 15 307.835 ± 5.730 ops/ms MessageDigests.getAndDigest SHA-1 16 15 1326.887 ± 63.263 ops/ms MessageDigests.getAndDigest SHA-1 1024 15 106.611 ± 2.292 ops/ms MessageDigests.getAndDigest SHA-256 16 15 961.589 ± 82.052 ops/ms MessageDigests.getAndDigest SHA-256 1024 15 88.646 ± 0.194 ops/ms MessageDigests.getAndDigest SHA-512 16 15 775.417 ± 56.775 ops/ms MessageDigests.getAndDigest SHA-512 1024 15 112.904 ± 2.014 ops/ms
GC MessageDigests.getAndDigest:·gc.alloc.rate.norm MD5 16 15 232.009 ± 0.006 B/op MessageDigests.getAndDigest:·gc.alloc.rate.norm SHA-1 16 15 584.021 ± 0.001 B/op MessageDigests.getAndDigest:·gc.alloc.rate.norm SHA-256 16 15 272.012 ± 0.015 B/op MessageDigests.getAndDigest:·gc.alloc.rate.norm SHA-512 16 15 400.017 ± 0.019 B/op
For the `digest` micro digesting small inputs is faster with all algorithms, ranging from ~1% for SHA-512 up to ~15% for MD5. The gain stems from not allocating and reading into a temporary buffer once outside of the intrinsic. SHA-1 does not see a statistically gain because the intrinsic is disabled by default on my HW.
For the `getAndDigest` micro - which tests `MessageDigest.getInstance(..).digest(..)` there are similar gains with this patch. The interesting aspect here is verifying the reduction in allocations per operation when there's an active intrinsic (again, not for SHA-1). JDK-8259065 (#1933) reduced allocations on each of these with 144B/op, which means allocation pressure for SHA-512 is down two thirds from 1200B/op to 400B/op in this contrived test.
I've verified there are no regressions in the absence of the intrinsic - which the SHA-1 numbers here help show.
test/micro/org/openjdk/bench/java/util/UUIDBench.java line 2:
1: /* 2: * Copyright (c) 2020, 2021, Oracle and/or its affiliates. All rights reserved.
nit: other files should also have this 2021 update. It seems most of them are not updated and still uses 2020. ------------- PR: https://git.openjdk.java.net/jdk/pull/1855
On Fri, 15 Jan 2021 23:21:00 GMT, Valerie Peng <valeriep@openjdk.org> wrote:
Claes Redestad has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 20 additional commits since the last revision:
- Copyrights - Merge branch 'master' into improve_md5 - Remove unused Unsafe import - Harmonize MD4 impl, remove now-redundant checks from ByteArrayAccess (VHs do bounds checks, most of which will be optimized away) - Merge branch 'master' into improve_md5 - Apply allocation avoiding optimizations to all SHA versions sharing structural similarities with MD5 - Remove unused reverseBytes imports - Copyrights - Fix copy-paste error - Various fixes (IDE stopped IDEing..) - ... and 10 more: https://git.openjdk.java.net/jdk/compare/6e03c8d3...cafa3e49
test/micro/org/openjdk/bench/java/util/UUIDBench.java line 2:
1: /* 2: * Copyright (c) 2020, 2021, Oracle and/or its affiliates. All rights reserved.
nit: other files should also have this 2021 update. It seems most of them are not updated and still uses 2020.
fixed ------------- PR: https://git.openjdk.java.net/jdk/pull/1855
- The MD5 intrinsics added by [JDK-8250902](https://bugs.openjdk.java.net/browse/JDK-8250902) shows that the `int[] x` isn't actually needed. This also applies to the SHA intrinsics from which the MD5 intrinsic takes inspiration - Using VarHandles we can simplify the code in `ByteArrayAccess` enough to make it acceptable to use inline and replace the array in MD5 wholesale. This improves performance both in the presence and the absence of the intrinsic optimization. - Doing the exact same thing in the SHA impls would be unwieldy (64+ element arrays), but allocating the array lazily gets most of the speed-up in the presence of an intrinsic while being neutral in its absence.
Baseline: (digesterName) (length) Cnt Score Error Units MessageDigests.digest MD5 16 15 2714.307 ± 21.133 ops/ms MessageDigests.digest MD5 1024 15 318.087 ± 0.637 ops/ms MessageDigests.digest SHA-1 16 15 1387.266 ± 40.932 ops/ms MessageDigests.digest SHA-1 1024 15 109.273 ± 0.149 ops/ms MessageDigests.digest SHA-256 16 15 995.566 ± 21.186 ops/ms MessageDigests.digest SHA-256 1024 15 89.104 ± 0.079 ops/ms MessageDigests.digest SHA-512 16 15 803.030 ± 15.722 ops/ms MessageDigests.digest SHA-512 1024 15 115.611 ± 0.234 ops/ms MessageDigests.getAndDigest MD5 16 15 2190.367 ± 97.037 ops/ms MessageDigests.getAndDigest MD5 1024 15 302.903 ± 1.809 ops/ms MessageDigests.getAndDigest SHA-1 16 15 1262.656 ± 43.751 ops/ms MessageDigests.getAndDigest SHA-1 1024 15 104.889 ± 3.554 ops/ms MessageDigests.getAndDigest SHA-256 16 15 914.541 ± 55.621 ops/ms MessageDigests.getAndDigest SHA-256 1024 15 85.708 ± 1.394 ops/ms MessageDigests.getAndDigest SHA-512 16 15 737.719 ± 53.671 ops/ms MessageDigests.getAndDigest SHA-512 1024 15 112.307 ± 1.950 ops/ms
GC: MessageDigests.getAndDigest:·gc.alloc.rate.norm MD5 16 15 312.011 ± 0.005 B/op MessageDigests.getAndDigest:·gc.alloc.rate.norm SHA-1 16 15 584.020 ± 0.006 B/op MessageDigests.getAndDigest:·gc.alloc.rate.norm SHA-256 16 15 544.019 ± 0.016 B/op MessageDigests.getAndDigest:·gc.alloc.rate.norm SHA-512 16 15 1056.037 ± 0.003 B/op
Target: Benchmark (digesterName) (length) Cnt Score Error Units MessageDigests.digest MD5 16 15 3134.462 ± 43.685 ops/ms MessageDigests.digest MD5 1024 15 323.667 ± 0.633 ops/ms MessageDigests.digest SHA-1 16 15 1418.742 ± 38.223 ops/ms MessageDigests.digest SHA-1 1024 15 110.178 ± 0.788 ops/ms MessageDigests.digest SHA-256 16 15 1037.949 ± 21.214 ops/ms MessageDigests.digest SHA-256 1024 15 89.671 ± 0.228 ops/ms MessageDigests.digest SHA-512 16 15 812.028 ± 39.489 ops/ms MessageDigests.digest SHA-512 1024 15 116.738 ± 0.249 ops/ms MessageDigests.getAndDigest MD5 16 15 2314.379 ± 229.294 ops/ms MessageDigests.getAndDigest MD5 1024 15 307.835 ± 5.730 ops/ms MessageDigests.getAndDigest SHA-1 16 15 1326.887 ± 63.263 ops/ms MessageDigests.getAndDigest SHA-1 1024 15 106.611 ± 2.292 ops/ms MessageDigests.getAndDigest SHA-256 16 15 961.589 ± 82.052 ops/ms MessageDigests.getAndDigest SHA-256 1024 15 88.646 ± 0.194 ops/ms MessageDigests.getAndDigest SHA-512 16 15 775.417 ± 56.775 ops/ms MessageDigests.getAndDigest SHA-512 1024 15 112.904 ± 2.014 ops/ms
GC MessageDigests.getAndDigest:·gc.alloc.rate.norm MD5 16 15 232.009 ± 0.006 B/op MessageDigests.getAndDigest:·gc.alloc.rate.norm SHA-1 16 15 584.021 ± 0.001 B/op MessageDigests.getAndDigest:·gc.alloc.rate.norm SHA-256 16 15 272.012 ± 0.015 B/op MessageDigests.getAndDigest:·gc.alloc.rate.norm SHA-512 16 15 400.017 ± 0.019 B/op
For the `digest` micro digesting small inputs is faster with all algorithms, ranging from ~1% for SHA-512 up to ~15% for MD5. The gain stems from not allocating and reading into a temporary buffer once outside of the intrinsic. SHA-1 does not see a statistically gain because the intrinsic is disabled by default on my HW.
For the `getAndDigest` micro - which tests `MessageDigest.getInstance(..).digest(..)` there are similar gains with this patch. The interesting aspect here is verifying the reduction in allocations per operation when there's an active intrinsic (again, not for SHA-1). JDK-8259065 (#1933) reduced allocations on each of these with 144B/op, which means allocation pressure for SHA-512 is down two thirds from 1200B/op to 400B/op in this contrived test.
I've verified there are no regressions in the absence of the intrinsic - which the SHA-1 numbers here help show.
Claes Redestad has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 20 additional commits since the last revision: - Copyrights - Merge branch 'master' into improve_md5 - Remove unused Unsafe import - Harmonize MD4 impl, remove now-redundant checks from ByteArrayAccess (VHs do bounds checks, most of which will be optimized away) - Merge branch 'master' into improve_md5 - Apply allocation avoiding optimizations to all SHA versions sharing structural similarities with MD5 - Remove unused reverseBytes imports - Copyrights - Fix copy-paste error - Various fixes (IDE stopped IDEing..) - ... and 10 more: https://git.openjdk.java.net/jdk/compare/03e99844...cafa3e49 ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/1855/files - new: https://git.openjdk.java.net/jdk/pull/1855/files/e1c943c5..cafa3e49 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=1855&range=01 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=1855&range=00-01 Stats: 28760 lines in 1103 files changed: 16020 ins; 7214 del; 5526 mod Patch: https://git.openjdk.java.net/jdk/pull/1855.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/1855/head:pull/1855 PR: https://git.openjdk.java.net/jdk/pull/1855
On Fri, 15 Jan 2021 23:36:35 GMT, Claes Redestad <redestad@openjdk.org> wrote:
- The MD5 intrinsics added by [JDK-8250902](https://bugs.openjdk.java.net/browse/JDK-8250902) shows that the `int[] x` isn't actually needed. This also applies to the SHA intrinsics from which the MD5 intrinsic takes inspiration - Using VarHandles we can simplify the code in `ByteArrayAccess` enough to make it acceptable to use inline and replace the array in MD5 wholesale. This improves performance both in the presence and the absence of the intrinsic optimization. - Doing the exact same thing in the SHA impls would be unwieldy (64+ element arrays), but allocating the array lazily gets most of the speed-up in the presence of an intrinsic while being neutral in its absence.
Baseline: (digesterName) (length) Cnt Score Error Units MessageDigests.digest MD5 16 15 2714.307 ± 21.133 ops/ms MessageDigests.digest MD5 1024 15 318.087 ± 0.637 ops/ms MessageDigests.digest SHA-1 16 15 1387.266 ± 40.932 ops/ms MessageDigests.digest SHA-1 1024 15 109.273 ± 0.149 ops/ms MessageDigests.digest SHA-256 16 15 995.566 ± 21.186 ops/ms MessageDigests.digest SHA-256 1024 15 89.104 ± 0.079 ops/ms MessageDigests.digest SHA-512 16 15 803.030 ± 15.722 ops/ms MessageDigests.digest SHA-512 1024 15 115.611 ± 0.234 ops/ms MessageDigests.getAndDigest MD5 16 15 2190.367 ± 97.037 ops/ms MessageDigests.getAndDigest MD5 1024 15 302.903 ± 1.809 ops/ms MessageDigests.getAndDigest SHA-1 16 15 1262.656 ± 43.751 ops/ms MessageDigests.getAndDigest SHA-1 1024 15 104.889 ± 3.554 ops/ms MessageDigests.getAndDigest SHA-256 16 15 914.541 ± 55.621 ops/ms MessageDigests.getAndDigest SHA-256 1024 15 85.708 ± 1.394 ops/ms MessageDigests.getAndDigest SHA-512 16 15 737.719 ± 53.671 ops/ms MessageDigests.getAndDigest SHA-512 1024 15 112.307 ± 1.950 ops/ms
GC: MessageDigests.getAndDigest:·gc.alloc.rate.norm MD5 16 15 312.011 ± 0.005 B/op MessageDigests.getAndDigest:·gc.alloc.rate.norm SHA-1 16 15 584.020 ± 0.006 B/op MessageDigests.getAndDigest:·gc.alloc.rate.norm SHA-256 16 15 544.019 ± 0.016 B/op MessageDigests.getAndDigest:·gc.alloc.rate.norm SHA-512 16 15 1056.037 ± 0.003 B/op
Target: Benchmark (digesterName) (length) Cnt Score Error Units MessageDigests.digest MD5 16 15 3134.462 ± 43.685 ops/ms MessageDigests.digest MD5 1024 15 323.667 ± 0.633 ops/ms MessageDigests.digest SHA-1 16 15 1418.742 ± 38.223 ops/ms MessageDigests.digest SHA-1 1024 15 110.178 ± 0.788 ops/ms MessageDigests.digest SHA-256 16 15 1037.949 ± 21.214 ops/ms MessageDigests.digest SHA-256 1024 15 89.671 ± 0.228 ops/ms MessageDigests.digest SHA-512 16 15 812.028 ± 39.489 ops/ms MessageDigests.digest SHA-512 1024 15 116.738 ± 0.249 ops/ms MessageDigests.getAndDigest MD5 16 15 2314.379 ± 229.294 ops/ms MessageDigests.getAndDigest MD5 1024 15 307.835 ± 5.730 ops/ms MessageDigests.getAndDigest SHA-1 16 15 1326.887 ± 63.263 ops/ms MessageDigests.getAndDigest SHA-1 1024 15 106.611 ± 2.292 ops/ms MessageDigests.getAndDigest SHA-256 16 15 961.589 ± 82.052 ops/ms MessageDigests.getAndDigest SHA-256 1024 15 88.646 ± 0.194 ops/ms MessageDigests.getAndDigest SHA-512 16 15 775.417 ± 56.775 ops/ms MessageDigests.getAndDigest SHA-512 1024 15 112.904 ± 2.014 ops/ms
GC MessageDigests.getAndDigest:·gc.alloc.rate.norm MD5 16 15 232.009 ± 0.006 B/op MessageDigests.getAndDigest:·gc.alloc.rate.norm SHA-1 16 15 584.021 ± 0.001 B/op MessageDigests.getAndDigest:·gc.alloc.rate.norm SHA-256 16 15 272.012 ± 0.015 B/op MessageDigests.getAndDigest:·gc.alloc.rate.norm SHA-512 16 15 400.017 ± 0.019 B/op
For the `digest` micro digesting small inputs is faster with all algorithms, ranging from ~1% for SHA-512 up to ~15% for MD5. The gain stems from not allocating and reading into a temporary buffer once outside of the intrinsic. SHA-1 does not see a statistically gain because the intrinsic is disabled by default on my HW.
For the `getAndDigest` micro - which tests `MessageDigest.getInstance(..).digest(..)` there are similar gains with this patch. The interesting aspect here is verifying the reduction in allocations per operation when there's an active intrinsic (again, not for SHA-1). JDK-8259065 (#1933) reduced allocations on each of these with 144B/op, which means allocation pressure for SHA-512 is down two thirds from 1200B/op to 400B/op in this contrived test.
I've verified there are no regressions in the absence of the intrinsic - which the SHA-1 numbers here help show.
Claes Redestad has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 20 additional commits since the last revision:
- Copyrights - Merge branch 'master' into improve_md5 - Remove unused Unsafe import - Harmonize MD4 impl, remove now-redundant checks from ByteArrayAccess (VHs do bounds checks, most of which will be optimized away) - Merge branch 'master' into improve_md5 - Apply allocation avoiding optimizations to all SHA versions sharing structural similarities with MD5 - Remove unused reverseBytes imports - Copyrights - Fix copy-paste error - Various fixes (IDE stopped IDEing..) - ... and 10 more: https://git.openjdk.java.net/jdk/compare/18f8493b...cafa3e49
Changes look good. Thanks. ------------- Marked as reviewed by valeriep (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/1855
- The MD5 intrinsics added by [JDK-8250902](https://bugs.openjdk.java.net/browse/JDK-8250902) shows that the `int[] x` isn't actually needed. This also applies to the SHA intrinsics from which the MD5 intrinsic takes inspiration - Using VarHandles we can simplify the code in `ByteArrayAccess` enough to make it acceptable to use inline and replace the array in MD5 wholesale. This improves performance both in the presence and the absence of the intrinsic optimization. - Doing the exact same thing in the SHA impls would be unwieldy (64+ element arrays), but allocating the array lazily gets most of the speed-up in the presence of an intrinsic while being neutral in its absence.
Baseline: (digesterName) (length) Cnt Score Error Units MessageDigests.digest MD5 16 15 2714.307 ± 21.133 ops/ms MessageDigests.digest MD5 1024 15 318.087 ± 0.637 ops/ms MessageDigests.digest SHA-1 16 15 1387.266 ± 40.932 ops/ms MessageDigests.digest SHA-1 1024 15 109.273 ± 0.149 ops/ms MessageDigests.digest SHA-256 16 15 995.566 ± 21.186 ops/ms MessageDigests.digest SHA-256 1024 15 89.104 ± 0.079 ops/ms MessageDigests.digest SHA-512 16 15 803.030 ± 15.722 ops/ms MessageDigests.digest SHA-512 1024 15 115.611 ± 0.234 ops/ms MessageDigests.getAndDigest MD5 16 15 2190.367 ± 97.037 ops/ms MessageDigests.getAndDigest MD5 1024 15 302.903 ± 1.809 ops/ms MessageDigests.getAndDigest SHA-1 16 15 1262.656 ± 43.751 ops/ms MessageDigests.getAndDigest SHA-1 1024 15 104.889 ± 3.554 ops/ms MessageDigests.getAndDigest SHA-256 16 15 914.541 ± 55.621 ops/ms MessageDigests.getAndDigest SHA-256 1024 15 85.708 ± 1.394 ops/ms MessageDigests.getAndDigest SHA-512 16 15 737.719 ± 53.671 ops/ms MessageDigests.getAndDigest SHA-512 1024 15 112.307 ± 1.950 ops/ms
GC: MessageDigests.getAndDigest:·gc.alloc.rate.norm MD5 16 15 312.011 ± 0.005 B/op MessageDigests.getAndDigest:·gc.alloc.rate.norm SHA-1 16 15 584.020 ± 0.006 B/op MessageDigests.getAndDigest:·gc.alloc.rate.norm SHA-256 16 15 544.019 ± 0.016 B/op MessageDigests.getAndDigest:·gc.alloc.rate.norm SHA-512 16 15 1056.037 ± 0.003 B/op
Target: Benchmark (digesterName) (length) Cnt Score Error Units MessageDigests.digest MD5 16 15 3134.462 ± 43.685 ops/ms MessageDigests.digest MD5 1024 15 323.667 ± 0.633 ops/ms MessageDigests.digest SHA-1 16 15 1418.742 ± 38.223 ops/ms MessageDigests.digest SHA-1 1024 15 110.178 ± 0.788 ops/ms MessageDigests.digest SHA-256 16 15 1037.949 ± 21.214 ops/ms MessageDigests.digest SHA-256 1024 15 89.671 ± 0.228 ops/ms MessageDigests.digest SHA-512 16 15 812.028 ± 39.489 ops/ms MessageDigests.digest SHA-512 1024 15 116.738 ± 0.249 ops/ms MessageDigests.getAndDigest MD5 16 15 2314.379 ± 229.294 ops/ms MessageDigests.getAndDigest MD5 1024 15 307.835 ± 5.730 ops/ms MessageDigests.getAndDigest SHA-1 16 15 1326.887 ± 63.263 ops/ms MessageDigests.getAndDigest SHA-1 1024 15 106.611 ± 2.292 ops/ms MessageDigests.getAndDigest SHA-256 16 15 961.589 ± 82.052 ops/ms MessageDigests.getAndDigest SHA-256 1024 15 88.646 ± 0.194 ops/ms MessageDigests.getAndDigest SHA-512 16 15 775.417 ± 56.775 ops/ms MessageDigests.getAndDigest SHA-512 1024 15 112.904 ± 2.014 ops/ms
GC MessageDigests.getAndDigest:·gc.alloc.rate.norm MD5 16 15 232.009 ± 0.006 B/op MessageDigests.getAndDigest:·gc.alloc.rate.norm SHA-1 16 15 584.021 ± 0.001 B/op MessageDigests.getAndDigest:·gc.alloc.rate.norm SHA-256 16 15 272.012 ± 0.015 B/op MessageDigests.getAndDigest:·gc.alloc.rate.norm SHA-512 16 15 400.017 ± 0.019 B/op
For the `digest` micro digesting small inputs is faster with all algorithms, ranging from ~1% for SHA-512 up to ~15% for MD5. The gain stems from not allocating and reading into a temporary buffer once outside of the intrinsic. SHA-1 does not see a statistically gain because the intrinsic is disabled by default on my HW.
For the `getAndDigest` micro - which tests `MessageDigest.getInstance(..).digest(..)` there are similar gains with this patch. The interesting aspect here is verifying the reduction in allocations per operation when there's an active intrinsic (again, not for SHA-1). JDK-8259065 (#1933) reduced allocations on each of these with 144B/op, which means allocation pressure for SHA-512 is down two thirds from 1200B/op to 400B/op in this contrived test.
I've verified there are no regressions in the absence of the intrinsic - which the SHA-1 numbers here help show.
Claes Redestad has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 22 additional commits since the last revision: - Adjust to keep reflection-driven tests from failing - Merge branch 'master' into improve_md5 - Copyrights - Merge branch 'master' into improve_md5 - Remove unused Unsafe import - Harmonize MD4 impl, remove now-redundant checks from ByteArrayAccess (VHs do bounds checks, most of which will be optimized away) - Merge branch 'master' into improve_md5 - Apply allocation avoiding optimizations to all SHA versions sharing structural similarities with MD5 - Remove unused reverseBytes imports - Copyrights - ... and 12 more: https://git.openjdk.java.net/jdk/compare/25fa448d...fdd2d19e ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/1855/files - new: https://git.openjdk.java.net/jdk/pull/1855/files/cafa3e49..fdd2d19e Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=1855&range=02 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=1855&range=01-02 Stats: 11783 lines in 75 files changed: 1309 ins; 9196 del; 1278 mod Patch: https://git.openjdk.java.net/jdk/pull/1855.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/1855/head:pull/1855 PR: https://git.openjdk.java.net/jdk/pull/1855
- The MD5 intrinsics added by [JDK-8250902](https://bugs.openjdk.java.net/browse/JDK-8250902) shows that the `int[] x` isn't actually needed. This also applies to the SHA intrinsics from which the MD5 intrinsic takes inspiration - Using VarHandles we can simplify the code in `ByteArrayAccess` enough to make it acceptable to use inline and replace the array in MD5 wholesale. This improves performance both in the presence and the absence of the intrinsic optimization. - Doing the exact same thing in the SHA impls would be unwieldy (64+ element arrays), but allocating the array lazily gets most of the speed-up in the presence of an intrinsic while being neutral in its absence.
Baseline: (digesterName) (length) Cnt Score Error Units MessageDigests.digest MD5 16 15 2714.307 ± 21.133 ops/ms MessageDigests.digest MD5 1024 15 318.087 ± 0.637 ops/ms MessageDigests.digest SHA-1 16 15 1387.266 ± 40.932 ops/ms MessageDigests.digest SHA-1 1024 15 109.273 ± 0.149 ops/ms MessageDigests.digest SHA-256 16 15 995.566 ± 21.186 ops/ms MessageDigests.digest SHA-256 1024 15 89.104 ± 0.079 ops/ms MessageDigests.digest SHA-512 16 15 803.030 ± 15.722 ops/ms MessageDigests.digest SHA-512 1024 15 115.611 ± 0.234 ops/ms MessageDigests.getAndDigest MD5 16 15 2190.367 ± 97.037 ops/ms MessageDigests.getAndDigest MD5 1024 15 302.903 ± 1.809 ops/ms MessageDigests.getAndDigest SHA-1 16 15 1262.656 ± 43.751 ops/ms MessageDigests.getAndDigest SHA-1 1024 15 104.889 ± 3.554 ops/ms MessageDigests.getAndDigest SHA-256 16 15 914.541 ± 55.621 ops/ms MessageDigests.getAndDigest SHA-256 1024 15 85.708 ± 1.394 ops/ms MessageDigests.getAndDigest SHA-512 16 15 737.719 ± 53.671 ops/ms MessageDigests.getAndDigest SHA-512 1024 15 112.307 ± 1.950 ops/ms
GC: MessageDigests.getAndDigest:·gc.alloc.rate.norm MD5 16 15 312.011 ± 0.005 B/op MessageDigests.getAndDigest:·gc.alloc.rate.norm SHA-1 16 15 584.020 ± 0.006 B/op MessageDigests.getAndDigest:·gc.alloc.rate.norm SHA-256 16 15 544.019 ± 0.016 B/op MessageDigests.getAndDigest:·gc.alloc.rate.norm SHA-512 16 15 1056.037 ± 0.003 B/op
Target: Benchmark (digesterName) (length) Cnt Score Error Units MessageDigests.digest MD5 16 15 3134.462 ± 43.685 ops/ms MessageDigests.digest MD5 1024 15 323.667 ± 0.633 ops/ms MessageDigests.digest SHA-1 16 15 1418.742 ± 38.223 ops/ms MessageDigests.digest SHA-1 1024 15 110.178 ± 0.788 ops/ms MessageDigests.digest SHA-256 16 15 1037.949 ± 21.214 ops/ms MessageDigests.digest SHA-256 1024 15 89.671 ± 0.228 ops/ms MessageDigests.digest SHA-512 16 15 812.028 ± 39.489 ops/ms MessageDigests.digest SHA-512 1024 15 116.738 ± 0.249 ops/ms MessageDigests.getAndDigest MD5 16 15 2314.379 ± 229.294 ops/ms MessageDigests.getAndDigest MD5 1024 15 307.835 ± 5.730 ops/ms MessageDigests.getAndDigest SHA-1 16 15 1326.887 ± 63.263 ops/ms MessageDigests.getAndDigest SHA-1 1024 15 106.611 ± 2.292 ops/ms MessageDigests.getAndDigest SHA-256 16 15 961.589 ± 82.052 ops/ms MessageDigests.getAndDigest SHA-256 1024 15 88.646 ± 0.194 ops/ms MessageDigests.getAndDigest SHA-512 16 15 775.417 ± 56.775 ops/ms MessageDigests.getAndDigest SHA-512 1024 15 112.904 ± 2.014 ops/ms
GC MessageDigests.getAndDigest:·gc.alloc.rate.norm MD5 16 15 232.009 ± 0.006 B/op MessageDigests.getAndDigest:·gc.alloc.rate.norm SHA-1 16 15 584.021 ± 0.001 B/op MessageDigests.getAndDigest:·gc.alloc.rate.norm SHA-256 16 15 272.012 ± 0.015 B/op MessageDigests.getAndDigest:·gc.alloc.rate.norm SHA-512 16 15 400.017 ± 0.019 B/op
For the `digest` micro digesting small inputs is faster with all algorithms, ranging from ~1% for SHA-512 up to ~15% for MD5. The gain stems from not allocating and reading into a temporary buffer once outside of the intrinsic. SHA-1 does not see a statistically gain because the intrinsic is disabled by default on my HW.
For the `getAndDigest` micro - which tests `MessageDigest.getInstance(..).digest(..)` there are similar gains with this patch. The interesting aspect here is verifying the reduction in allocations per operation when there's an active intrinsic (again, not for SHA-1). JDK-8259065 (#1933) reduced allocations on each of these with 144B/op, which means allocation pressure for SHA-512 is down two thirds from 1200B/op to 400B/op in this contrived test.
I've verified there are no regressions in the absence of the intrinsic - which the SHA-1 numbers here help show.
Claes Redestad has updated the pull request incrementally with one additional commit since the last revision: Remove unused code ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/1855/files - new: https://git.openjdk.java.net/jdk/pull/1855/files/fdd2d19e..4c2798aa Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=1855&range=03 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=1855&range=02-03 Stats: 16 lines in 1 file changed: 0 ins; 16 del; 0 mod Patch: https://git.openjdk.java.net/jdk/pull/1855.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/1855/head:pull/1855 PR: https://git.openjdk.java.net/jdk/pull/1855
On Mon, 18 Jan 2021 13:39:04 GMT, Claes Redestad <redestad@openjdk.org> wrote:
- The MD5 intrinsics added by [JDK-8250902](https://bugs.openjdk.java.net/browse/JDK-8250902) shows that the `int[] x` isn't actually needed. This also applies to the SHA intrinsics from which the MD5 intrinsic takes inspiration - Using VarHandles we can simplify the code in `ByteArrayAccess` enough to make it acceptable to use inline and replace the array in MD5 wholesale. This improves performance both in the presence and the absence of the intrinsic optimization. - Doing the exact same thing in the SHA impls would be unwieldy (64+ element arrays), but allocating the array lazily gets most of the speed-up in the presence of an intrinsic while being neutral in its absence.
Baseline: (digesterName) (length) Cnt Score Error Units MessageDigests.digest MD5 16 15 2714.307 ± 21.133 ops/ms MessageDigests.digest MD5 1024 15 318.087 ± 0.637 ops/ms MessageDigests.digest SHA-1 16 15 1387.266 ± 40.932 ops/ms MessageDigests.digest SHA-1 1024 15 109.273 ± 0.149 ops/ms MessageDigests.digest SHA-256 16 15 995.566 ± 21.186 ops/ms MessageDigests.digest SHA-256 1024 15 89.104 ± 0.079 ops/ms MessageDigests.digest SHA-512 16 15 803.030 ± 15.722 ops/ms MessageDigests.digest SHA-512 1024 15 115.611 ± 0.234 ops/ms MessageDigests.getAndDigest MD5 16 15 2190.367 ± 97.037 ops/ms MessageDigests.getAndDigest MD5 1024 15 302.903 ± 1.809 ops/ms MessageDigests.getAndDigest SHA-1 16 15 1262.656 ± 43.751 ops/ms MessageDigests.getAndDigest SHA-1 1024 15 104.889 ± 3.554 ops/ms MessageDigests.getAndDigest SHA-256 16 15 914.541 ± 55.621 ops/ms MessageDigests.getAndDigest SHA-256 1024 15 85.708 ± 1.394 ops/ms MessageDigests.getAndDigest SHA-512 16 15 737.719 ± 53.671 ops/ms MessageDigests.getAndDigest SHA-512 1024 15 112.307 ± 1.950 ops/ms
GC: MessageDigests.getAndDigest:·gc.alloc.rate.norm MD5 16 15 312.011 ± 0.005 B/op MessageDigests.getAndDigest:·gc.alloc.rate.norm SHA-1 16 15 584.020 ± 0.006 B/op MessageDigests.getAndDigest:·gc.alloc.rate.norm SHA-256 16 15 544.019 ± 0.016 B/op MessageDigests.getAndDigest:·gc.alloc.rate.norm SHA-512 16 15 1056.037 ± 0.003 B/op
Target: Benchmark (digesterName) (length) Cnt Score Error Units MessageDigests.digest MD5 16 15 3134.462 ± 43.685 ops/ms MessageDigests.digest MD5 1024 15 323.667 ± 0.633 ops/ms MessageDigests.digest SHA-1 16 15 1418.742 ± 38.223 ops/ms MessageDigests.digest SHA-1 1024 15 110.178 ± 0.788 ops/ms MessageDigests.digest SHA-256 16 15 1037.949 ± 21.214 ops/ms MessageDigests.digest SHA-256 1024 15 89.671 ± 0.228 ops/ms MessageDigests.digest SHA-512 16 15 812.028 ± 39.489 ops/ms MessageDigests.digest SHA-512 1024 15 116.738 ± 0.249 ops/ms MessageDigests.getAndDigest MD5 16 15 2314.379 ± 229.294 ops/ms MessageDigests.getAndDigest MD5 1024 15 307.835 ± 5.730 ops/ms MessageDigests.getAndDigest SHA-1 16 15 1326.887 ± 63.263 ops/ms MessageDigests.getAndDigest SHA-1 1024 15 106.611 ± 2.292 ops/ms MessageDigests.getAndDigest SHA-256 16 15 961.589 ± 82.052 ops/ms MessageDigests.getAndDigest SHA-256 1024 15 88.646 ± 0.194 ops/ms MessageDigests.getAndDigest SHA-512 16 15 775.417 ± 56.775 ops/ms MessageDigests.getAndDigest SHA-512 1024 15 112.904 ± 2.014 ops/ms
GC MessageDigests.getAndDigest:·gc.alloc.rate.norm MD5 16 15 232.009 ± 0.006 B/op MessageDigests.getAndDigest:·gc.alloc.rate.norm SHA-1 16 15 584.021 ± 0.001 B/op MessageDigests.getAndDigest:·gc.alloc.rate.norm SHA-256 16 15 272.012 ± 0.015 B/op MessageDigests.getAndDigest:·gc.alloc.rate.norm SHA-512 16 15 400.017 ± 0.019 B/op
For the `digest` micro digesting small inputs is faster with all algorithms, ranging from ~1% for SHA-512 up to ~15% for MD5. The gain stems from not allocating and reading into a temporary buffer once outside of the intrinsic. SHA-1 does not see a statistically gain because the intrinsic is disabled by default on my HW.
For the `getAndDigest` micro - which tests `MessageDigest.getInstance(..).digest(..)` there are similar gains with this patch. The interesting aspect here is verifying the reduction in allocations per operation when there's an active intrinsic (again, not for SHA-1). JDK-8259065 (#1933) reduced allocations on each of these with 144B/op, which means allocation pressure for SHA-512 is down two thirds from 1200B/op to 400B/op in this contrived test.
I've verified there are no regressions in the absence of the intrinsic - which the SHA-1 numbers here help show.
Claes Redestad has updated the pull request incrementally with one additional commit since the last revision:
Remove unused code
Marked as reviewed by valeriep (Reviewer). ------------- PR: https://git.openjdk.java.net/jdk/pull/1855
On Sun, 20 Dec 2020 20:27:03 GMT, Claes Redestad <redestad@openjdk.org> wrote:
- The MD5 intrinsics added by [JDK-8250902](https://bugs.openjdk.java.net/browse/JDK-8250902) shows that the `int[] x` isn't actually needed. This also applies to the SHA intrinsics from which the MD5 intrinsic takes inspiration - Using VarHandles we can simplify the code in `ByteArrayAccess` enough to make it acceptable to use inline and replace the array in MD5 wholesale. This improves performance both in the presence and the absence of the intrinsic optimization. - Doing the exact same thing in the SHA impls would be unwieldy (64+ element arrays), but allocating the array lazily gets most of the speed-up in the presence of an intrinsic while being neutral in its absence.
Baseline: (digesterName) (length) Cnt Score Error Units MessageDigests.digest MD5 16 15 2714.307 ± 21.133 ops/ms MessageDigests.digest MD5 1024 15 318.087 ± 0.637 ops/ms MessageDigests.digest SHA-1 16 15 1387.266 ± 40.932 ops/ms MessageDigests.digest SHA-1 1024 15 109.273 ± 0.149 ops/ms MessageDigests.digest SHA-256 16 15 995.566 ± 21.186 ops/ms MessageDigests.digest SHA-256 1024 15 89.104 ± 0.079 ops/ms MessageDigests.digest SHA-512 16 15 803.030 ± 15.722 ops/ms MessageDigests.digest SHA-512 1024 15 115.611 ± 0.234 ops/ms MessageDigests.getAndDigest MD5 16 15 2190.367 ± 97.037 ops/ms MessageDigests.getAndDigest MD5 1024 15 302.903 ± 1.809 ops/ms MessageDigests.getAndDigest SHA-1 16 15 1262.656 ± 43.751 ops/ms MessageDigests.getAndDigest SHA-1 1024 15 104.889 ± 3.554 ops/ms MessageDigests.getAndDigest SHA-256 16 15 914.541 ± 55.621 ops/ms MessageDigests.getAndDigest SHA-256 1024 15 85.708 ± 1.394 ops/ms MessageDigests.getAndDigest SHA-512 16 15 737.719 ± 53.671 ops/ms MessageDigests.getAndDigest SHA-512 1024 15 112.307 ± 1.950 ops/ms
GC: MessageDigests.getAndDigest:·gc.alloc.rate.norm MD5 16 15 312.011 ± 0.005 B/op MessageDigests.getAndDigest:·gc.alloc.rate.norm SHA-1 16 15 584.020 ± 0.006 B/op MessageDigests.getAndDigest:·gc.alloc.rate.norm SHA-256 16 15 544.019 ± 0.016 B/op MessageDigests.getAndDigest:·gc.alloc.rate.norm SHA-512 16 15 1056.037 ± 0.003 B/op
Target: Benchmark (digesterName) (length) Cnt Score Error Units MessageDigests.digest MD5 16 15 3134.462 ± 43.685 ops/ms MessageDigests.digest MD5 1024 15 323.667 ± 0.633 ops/ms MessageDigests.digest SHA-1 16 15 1418.742 ± 38.223 ops/ms MessageDigests.digest SHA-1 1024 15 110.178 ± 0.788 ops/ms MessageDigests.digest SHA-256 16 15 1037.949 ± 21.214 ops/ms MessageDigests.digest SHA-256 1024 15 89.671 ± 0.228 ops/ms MessageDigests.digest SHA-512 16 15 812.028 ± 39.489 ops/ms MessageDigests.digest SHA-512 1024 15 116.738 ± 0.249 ops/ms MessageDigests.getAndDigest MD5 16 15 2314.379 ± 229.294 ops/ms MessageDigests.getAndDigest MD5 1024 15 307.835 ± 5.730 ops/ms MessageDigests.getAndDigest SHA-1 16 15 1326.887 ± 63.263 ops/ms MessageDigests.getAndDigest SHA-1 1024 15 106.611 ± 2.292 ops/ms MessageDigests.getAndDigest SHA-256 16 15 961.589 ± 82.052 ops/ms MessageDigests.getAndDigest SHA-256 1024 15 88.646 ± 0.194 ops/ms MessageDigests.getAndDigest SHA-512 16 15 775.417 ± 56.775 ops/ms MessageDigests.getAndDigest SHA-512 1024 15 112.904 ± 2.014 ops/ms
GC MessageDigests.getAndDigest:·gc.alloc.rate.norm MD5 16 15 232.009 ± 0.006 B/op MessageDigests.getAndDigest:·gc.alloc.rate.norm SHA-1 16 15 584.021 ± 0.001 B/op MessageDigests.getAndDigest:·gc.alloc.rate.norm SHA-256 16 15 272.012 ± 0.015 B/op MessageDigests.getAndDigest:·gc.alloc.rate.norm SHA-512 16 15 400.017 ± 0.019 B/op
For the `digest` micro digesting small inputs is faster with all algorithms, ranging from ~1% for SHA-512 up to ~15% for MD5. The gain stems from not allocating and reading into a temporary buffer once outside of the intrinsic. SHA-1 does not see a statistically gain because the intrinsic is disabled by default on my HW.
For the `getAndDigest` micro - which tests `MessageDigest.getInstance(..).digest(..)` there are similar gains with this patch. The interesting aspect here is verifying the reduction in allocations per operation when there's an active intrinsic (again, not for SHA-1). JDK-8259065 (#1933) reduced allocations on each of these with 144B/op, which means allocation pressure for SHA-512 is down two thirds from 1200B/op to 400B/op in this contrived test.
I've verified there are no regressions in the absence of the intrinsic - which the SHA-1 numbers here help show.
This pull request has now been integrated. Changeset: 35c9da70 Author: Claes Redestad <redestad@openjdk.org> URL: https://git.openjdk.java.net/jdk/commit/35c9da70 Stats: 655 lines in 8 files changed: 79 ins; 350 del; 226 mod 8259498: Reduce overhead of MD5 and SHA digests Reviewed-by: valeriep ------------- PR: https://git.openjdk.java.net/jdk/pull/1855
participants (3)
-
Claes Redestad
-
DellCliff
-
Valerie Peng