RFR: 8302113: Improve CRC32 intrinsic with crypto pmull on AArch64
Andrew Haley
aph at openjdk.org
Sun Feb 12 14:54:31 UTC 2023
On Thu, 9 Feb 2023 02:25:27 GMT, Yi-Fan Tsai <duke at openjdk.org> wrote:
> Instruction pmull and pmull2 support operating on 64-bit data in Cryptographic Extension. The execution throughput of this form raises from 1 on Neoverse N1 to 4 on Neoverse V1 while the latency remains 2. The CRC32 instructions did not changed: latency 2, throughput 1. As a result, computing CRC32 using pmull could perform better than using crc32 instruction.
>
> The following test has passed.
> test/hotspot/jtreg/compiler/intrinsics/zip/TestCRC32.java
>
> The throughput reported by the micro benchmark is measured on an EC2 c7g instance. The optimization shows 11 - 99% improvement when the input is at least 384 bytes.
>
> | input | 64 | 128 | 256 | 384 | 511 | 512 | 1,024 |
> | ------------- | ---------- | ---------- | ---------- | ---------- | ---------- | ---------- | ---------- |
> | improvement | 0.02% | 0.02% | 0.00% | 16.00% | 11.94% | 34.75% | 69.80% |
>
> | input | 2,048 | 4,096 | 8,192 | 16,384 | 32,768 | 65,536 |
> | ------------- | ---------- | ---------- | ---------- | ---------- | ---------- | ---------- |
> | improvement | 77.61% | 92.33% | 95.98% | 97.95% | 99.33% | 98.36% |
>
>
> Baseline
>
> TestCRC32.testCRC32Update 64 thrpt 12 173126.358 ± 118.330 ops/ms
> TestCRC32.testCRC32Update 128 thrpt 12 112910.118 ± 47.305 ops/ms
> TestCRC32.testCRC32Update 256 thrpt 12 66601.990 ± 7.294 ops/ms
> TestCRC32.testCRC32Update 384 thrpt 12 47229.319 ± 3.949 ops/ms
> TestCRC32.testCRC32Update 511 thrpt 12 33733.119 ± 4.076 ops/ms
> TestCRC32.testCRC32Update 512 thrpt 12 36584.565 ± 4.211 ops/ms
> TestCRC32.testCRC32Update 1024 thrpt 12 19239.083 ± 1.040 ops/ms
> TestCRC32.testCRC32Update 2048 thrpt 12 9875.652 ± 0.435 ops/ms
> TestCRC32.testCRC32Update 4096 thrpt 12 5004.425 ± 0.290 ops/ms
> TestCRC32.testCRC32Update 8192 thrpt 12 2519.185 ± 0.169 ops/ms
> TestCRC32.testCRC32Update 16384 thrpt 12 1263.909 ± 0.194 ops/ms
> TestCRC32.testCRC32Update 32768 thrpt 12 632.018 ± 0.053 ops/ms
> TestCRC32.testCRC32Update 65536 thrpt 12 315.471 ± 0.095 ops/ms
>
>
> Crypto pmull
>
> TestCRC32.testCRC32Update 64 thrpt 12 173168.669 ± 4.746 ops/ms
> TestCRC32.testCRC32Update 128 thrpt 12 112933.519 ± 4.583 ops/ms
> TestCRC32.testCRC32Update 256 thrpt 12 66602.462 ± 3.150 ops/ms
> TestCRC32.testCRC32Update 384 thrpt 12 54784.739 ± 2.110 ops/ms
> TestCRC32.testCRC32Update 511 thrpt 12 37760.816 ± 69.911 ops/ms
> TestCRC32.testCRC32Update 512 thrpt 12 49297.609 ± 21.983 ops/ms
> TestCRC32.testCRC32Update 1024 thrpt 12 32667.507 ± 90.610 ops/ms
> TestCRC32.testCRC32Update 2048 thrpt 12 17539.986 ± 511.416 ops/ms
> TestCRC32.testCRC32Update 4096 thrpt 12 9625.249 ± 9.713 ops/ms
> TestCRC32.testCRC32Update 8192 thrpt 12 4937.135 ± 6.121 ops/ms
> TestCRC32.testCRC32Update 16384 thrpt 12 2501.936 ± 1.270 ops/ms
> TestCRC32.testCRC32Update 32768 thrpt 12 1259.831 ± 0.119 ops/ms
> TestCRC32.testCRC32Update 65536 thrpt 12 625.773 ± 0.242 ops/ms
The optional CRC instructions in v8.0 become a requirement in ARMv8.1. ARMv8.0 is stuff like Cortex A53, but also including, apparently, Cortex A72. So I guess we're stuck with all three, at least for now. This is a bad business, but I guess it's something that Arm & partners have dropped on us, and there's little we can do about it.
-------------
PR: https://git.openjdk.org/jdk/pull/12480
More information about the hotspot-dev
mailing list