RFR: 8302113: Improve CRC32 intrinsic with crypto pmull on AArch64
Yi-Fan Tsai
duke at openjdk.org
Fri Feb 10 21:20:44 UTC 2023
Instruction pmull and pmull2 support operating on 64-bit data in Cryptographic Extension. The execution throughput of this form raises from 1 on Neoverse N1 to 4 on Neoverse V1 while the latency remains 2. The CRC32 instructions did not changed: latency 2, throughput 1. As a result, computing CRC32 using pmull could perform better than using crc32 instruction.
The following test has passed.
test/hotspot/jtreg/compiler/intrinsics/zip/TestCRC32.java
The throughput reported by the micro benchmark is measured on an EC2 c7g instance. The optimization shows 11 - 99% improvement when the input is at least 384 bytes.
| input | 64 | 128 | 256 | 384 | 511 | 512 | 1,024 |
| ------------------- | ---------- | ---------- | ---------- | ---------- | ---------- | ---------- | ---------- |
| CRC32 improvement | 0.02% | 0.02% | 0.00% | 16.00% | 11.94% | 34.75% | 69.80% |
| input | 2,048 | 4,096 | 8,192 | 16,384 | 32,768 | 65,536 |
| ------------------- | ---------- | ---------- | ---------- | ---------- | ---------- | ---------- |
| CRC32 improvement | 77.61% | 92.33% | 95.98% | 97.95% | 99.33% | 98.36% |
Baseline
TestCRC32.testCRC32Update 64 thrpt 12 173126.358 ± 118.330 ops/ms
TestCRC32.testCRC32Update 128 thrpt 12 112910.118 ± 47.305 ops/ms
TestCRC32.testCRC32Update 256 thrpt 12 66601.990 ± 7.294 ops/ms
TestCRC32.testCRC32Update 384 thrpt 12 47229.319 ± 3.949 ops/ms
TestCRC32.testCRC32Update 511 thrpt 12 33733.119 ± 4.076 ops/ms
TestCRC32.testCRC32Update 512 thrpt 12 36584.565 ± 4.211 ops/ms
TestCRC32.testCRC32Update 1024 thrpt 12 19239.083 ± 1.040 ops/ms
TestCRC32.testCRC32Update 2048 thrpt 12 9875.652 ± 0.435 ops/ms
TestCRC32.testCRC32Update 4096 thrpt 12 5004.425 ± 0.290 ops/ms
TestCRC32.testCRC32Update 8192 thrpt 12 2519.185 ± 0.169 ops/ms
TestCRC32.testCRC32Update 16384 thrpt 12 1263.909 ± 0.194 ops/ms
TestCRC32.testCRC32Update 32768 thrpt 12 632.018 ± 0.053 ops/ms
TestCRC32.testCRC32Update 65536 thrpt 12 315.471 ± 0.095 ops/ms
Crypto pmull
TestCRC32.testCRC32Update 64 thrpt 12 173168.669 ± 4.746 ops/ms
TestCRC32.testCRC32Update 128 thrpt 12 112933.519 ± 4.583 ops/ms
TestCRC32.testCRC32Update 256 thrpt 12 66602.462 ± 3.150 ops/ms
TestCRC32.testCRC32Update 384 thrpt 12 54784.739 ± 2.110 ops/ms
TestCRC32.testCRC32Update 511 thrpt 12 37760.816 ± 69.911 ops/ms
TestCRC32.testCRC32Update 512 thrpt 12 49297.609 ± 21.983 ops/ms
TestCRC32.testCRC32Update 1024 thrpt 12 32667.507 ± 90.610 ops/ms
TestCRC32.testCRC32Update 2048 thrpt 12 17539.986 ± 511.416 ops/ms
TestCRC32.testCRC32Update 4096 thrpt 12 9625.249 ± 9.713 ops/ms
TestCRC32.testCRC32Update 8192 thrpt 12 4937.135 ± 6.121 ops/ms
TestCRC32.testCRC32Update 16384 thrpt 12 2501.936 ± 1.270 ops/ms
TestCRC32.testCRC32Update 32768 thrpt 12 1259.831 ± 0.119 ops/ms
TestCRC32.testCRC32Update 65536 thrpt 12 625.773 ± 0.242 ops/ms
-------------
Commit messages:
- Remove CRC32-C
- Support CRC32-C
- Merge master
- Add microbenchmark TestCRC32
- Change code alignment
- Separate code paths
- Enable on Neoverse V1
- Disable the optimization by default
- Reduce data dependency before load
- PMULL
Changes: https://git.openjdk.org/jdk/pull/12480/files
Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=12480&range=00
Issue: https://bugs.openjdk.org/browse/JDK-8302113
Stats: 214 lines in 5 files changed: 213 ins; 0 del; 1 mod
Patch: https://git.openjdk.org/jdk/pull/12480.diff
Fetch: git fetch https://git.openjdk.org/jdk pull/12480/head:pull/12480
PR: https://git.openjdk.org/jdk/pull/12480
More information about the hotspot-dev
mailing list