RFR: 8302113: Improve CRC32 intrinsic with crypto pmull on AArch64
Yi-Fan Tsai
duke at openjdk.org
Sun Feb 12 02:06:25 UTC 2023
On Sat, 11 Feb 2023 23:30:56 GMT, Andrew Haley <aph at openjdk.org> wrote:
> How much more efficient is it than the old version of CRC32 using pmull?
The old version of CRC32 using pmull is inefficient comparing to the version using crc32 instructions measured on the same EC2 c7g instance.
The following is the performance comparing to the version using crc32 instructions.
| input | 64 | 128 | 256 | 384 | 511 | 512 | 1,024 |
| ---------------------------------- | ---------- | ---------- | ---------- | ---------- | ---------- | ---------- | ---------- |
| -XX:-UseCRC32 -XX:+UseNeon (pmull) | -88.10% | -88.31% | -88.57% | -88.64% | -88.71% | -88.73% | -88.85% |
| -XX:-UseCRC32 -XX:-UseNeon | -90.45% | -92.53% | -93.62% | -93.98% | -93.73% | -94.17% | -94.51% |
| input | 2,048 | 4,096 | 8,192 | 16,384 | 32,768 | 65,536 |
| -----------------------------------| ---------- | ---------- | ---------- | ---------- | ---------- | ---------- |
| -XX:-UseCRC32 -XX:+UseNeon (pmull) | -88.92% | -88.96% | -88.97% | -88.98% | -88.96% | -88.97% |
| -XX:-UseCRC32 -XX:-UseNeon | -94.58% | -94.65% | -94.68% | -94.70% | -94.70% | -94.70% |
`-XX:-UseCRC32 -XX:+UseNeon` (pmull)
TestCRC32.testCRC32Update 64 thrpt 12 20593.817 ± 8.676 ops/ms
TestCRC32.testCRC32Update 128 thrpt 12 13197.522 ± 4.268 ops/ms
TestCRC32.testCRC32Update 256 thrpt 12 7610.119 ± 6.329 ops/ms
TestCRC32.testCRC32Update 384 thrpt 12 5365.177 ± 0.570 ops/ms
TestCRC32.testCRC32Update 511 thrpt 12 3808.462 ± 3.441 ops/ms
TestCRC32.testCRC32Update 512 thrpt 12 4121.495 ± 4.549 ops/ms
TestCRC32.testCRC32Update 1024 thrpt 12 2144.889 ± 1.610 ops/ms
TestCRC32.testCRC32Update 2048 thrpt 12 1093.756 ± 0.114 ops/ms
TestCRC32.testCRC32Update 4096 thrpt 12 552.393 ± 0.164 ops/ms
TestCRC32.testCRC32Update 8192 thrpt 12 277.963 ± 0.099 ops/ms
TestCRC32.testCRC32Update 16384 thrpt 12 139.233 ± 0.061 ops/ms
TestCRC32.testCRC32Update 32768 thrpt 12 69.752 ± 0.013 ops/ms
TestCRC32.testCRC32Update 65536 thrpt 12 34.789 ± 0.002 ops/ms
`-XX:-UseCRC32 -XX:-UseNeon`
TestCRC32.testCRC32Update 64 thrpt 12 16541.230 ± 50.847 ops/ms
TestCRC32.testCRC32Update 128 thrpt 12 8432.515 ± 6.676 ops/ms
TestCRC32.testCRC32Update 256 thrpt 12 4252.450 ± 2.897 ops/ms
TestCRC32.testCRC32Update 384 thrpt 12 2844.691 ± 0.110 ops/ms
TestCRC32.testCRC32Update 511 thrpt 12 2114.317 ± 2.670 ops/ms
TestCRC32.testCRC32Update 512 thrpt 12 2134.668 ± 0.150 ops/ms
TestCRC32.testCRC32Update 1024 thrpt 12 1055.619 ± 27.552 ops/ms
TestCRC32.testCRC32Update 2048 thrpt 12 535.514 ± 0.273 ops/ms
TestCRC32.testCRC32Update 4096 thrpt 12 267.859 ± 0.133 ops/ms
TestCRC32.testCRC32Update 8192 thrpt 12 133.916 ± 0.014 ops/ms
TestCRC32.testCRC32Update 16384 thrpt 12 67.003 ± 0.034 ops/ms
TestCRC32.testCRC32Update 32768 thrpt 12 33.471 ± 0.023 ops/ms
TestCRC32.testCRC32Update 65536 thrpt 12 16.724 ± 0.018 ops/ms
> And do we today need three different hand-coded versions of CRC32?
Each version depends on different processor features. Removing the least efficient versions could degrade existing use cases.
It might be still reasonable to remove them though. The crc32c intrinsic only has one version, which uses crc32c instruction.
-------------
PR: https://git.openjdk.org/jdk/pull/12480
More information about the hotspot-dev
mailing list