RFR: 8302113: Improve CRC32 intrinsic with crypto pmull on AArch64 [v3]
Volker Simonis
simonis at openjdk.org
Thu Feb 16 12:37:30 UTC 2023
On Thu, 16 Feb 2023 06:13:03 GMT, Yi-Fan Tsai <duke at openjdk.org> wrote:
>> Instruction pmull and pmull2 support operating on 64-bit data in Cryptographic Extension. The execution throughput of this form raises from 1 on Neoverse N1 to 4 on Neoverse V1 while the latency remains 2. The CRC32 instructions did not changed: latency 2, throughput 1. As a result, computing CRC32 using pmull could perform better than using crc32 instruction.
>>
>> The following test has passed.
>> test/hotspot/jtreg/compiler/intrinsics/zip/TestCRC32.java
>>
>> The throughput reported by [the micro benchmark](https://github.com/openjdk/jdk/blob/master/test/micro/org/openjdk/bench/java/util/TestCRC32.java) is measured on an EC2 c7g instance. The optimization shows 11 - 99% improvement when the input is at least 384 bytes.
>>
>> | input | 64 | 128 | 256 | 384 | 511 | 512 | 1,024 |
>> | ------------- | ---------- | ---------- | ---------- | ---------- | ---------- | ---------- | ---------- |
>> | improvement | 0.02% | 0.02% | 0.00% | 16.00% | 11.94% | 34.75% | 69.80% |
>>
>> | input | 2,048 | 4,096 | 8,192 | 16,384 | 32,768 | 65,536 |
>> | ------------- | ---------- | ---------- | ---------- | ---------- | ---------- | ---------- |
>> | improvement | 77.61% | 92.33% | 95.98% | 97.95% | 99.33% | 98.36% |
>>
>>
>> Baseline
>>
>> TestCRC32.testCRC32Update 64 thrpt 12 173126.358 ± 118.330 ops/ms
>> TestCRC32.testCRC32Update 128 thrpt 12 112910.118 ± 47.305 ops/ms
>> TestCRC32.testCRC32Update 256 thrpt 12 66601.990 ± 7.294 ops/ms
>> TestCRC32.testCRC32Update 384 thrpt 12 47229.319 ± 3.949 ops/ms
>> TestCRC32.testCRC32Update 511 thrpt 12 33733.119 ± 4.076 ops/ms
>> TestCRC32.testCRC32Update 512 thrpt 12 36584.565 ± 4.211 ops/ms
>> TestCRC32.testCRC32Update 1024 thrpt 12 19239.083 ± 1.040 ops/ms
>> TestCRC32.testCRC32Update 2048 thrpt 12 9875.652 ± 0.435 ops/ms
>> TestCRC32.testCRC32Update 4096 thrpt 12 5004.425 ± 0.290 ops/ms
>> TestCRC32.testCRC32Update 8192 thrpt 12 2519.185 ± 0.169 ops/ms
>> TestCRC32.testCRC32Update 16384 thrpt 12 1263.909 ± 0.194 ops/ms
>> TestCRC32.testCRC32Update 32768 thrpt 12 632.018 ± 0.053 ops/ms
>> TestCRC32.testCRC32Update 65536 thrpt 12 315.471 ± 0.095 ops/ms
>>
>>
>> Crypto pmull
>>
>> TestCRC32.testCRC32Update 64 thrpt 12 173168.669 ± 4.746 ops/ms
>> TestCRC32.testCRC32Update 128 thrpt 12 112933.519 ± 4.583 ops/ms
>> TestCRC32.testCRC32Update 256 thrpt 12 66602.462 ± 3.150 ops/ms
>> TestCRC32.testCRC32Update 384 thrpt 12 54784.739 ± 2.110 ops/ms
>> TestCRC32.testCRC32Update 511 thrpt 12 37760.816 ± 69.911 ops/ms
>> TestCRC32.testCRC32Update 512 thrpt 12 49297.609 ± 21.983 ops/ms
>> TestCRC32.testCRC32Update 1024 thrpt 12 32667.507 ± 90.610 ops/ms
>> TestCRC32.testCRC32Update 2048 thrpt 12 17539.986 ± 511.416 ops/ms
>> TestCRC32.testCRC32Update 4096 thrpt 12 9625.249 ± 9.713 ops/ms
>> TestCRC32.testCRC32Update 8192 thrpt 12 4937.135 ± 6.121 ops/ms
>> TestCRC32.testCRC32Update 16384 thrpt 12 2501.936 ± 1.270 ops/ms
>> TestCRC32.testCRC32Update 32768 thrpt 12 1259.831 ± 0.119 ops/ms
>> TestCRC32.testCRC32Update 65536 thrpt 12 625.773 ± 0.242 ops/ms
>
> Yi-Fan Tsai has updated the pull request incrementally with one additional commit since the last revision:
>
> Make UseCryptoPmullForCRC32 independent of UseCRC32
Looks good now. Thanks for the explanation and for decoupling `UseCryptoPmullForCRC32` and `UseCRC32`.
-------------
Marked as reviewed by simonis (Reviewer).
PR: https://git.openjdk.org/jdk/pull/12480
More information about the hotspot-dev
mailing list