RFR: 8302113: Improve CRC32 intrinsic with crypto pmull on AArch64

Yi-Fan Tsai duke at openjdk.org
Sun Feb 12 02:06:25 UTC 2023


On Sat, 11 Feb 2023 23:30:56 GMT, Andrew Haley <aph at openjdk.org> wrote:

> How much more efficient is it than the old version of CRC32 using pmull?

The old version of CRC32 using pmull is inefficient comparing to the version using crc32 instructions measured on the same EC2 c7g instance.

The following is the performance comparing to the version using crc32 instructions.
| input                              | 64         | 128        | 256        | 384        | 511        | 512        | 1,024      |
| ---------------------------------- | ---------- | ---------- | ---------- | ---------- | ---------- | ---------- | ---------- |
| -XX:-UseCRC32 -XX:+UseNeon (pmull) | -88.10%    | -88.31%    | -88.57%    | -88.64%    | -88.71%    | -88.73%    | -88.85%    |
| -XX:-UseCRC32 -XX:-UseNeon         | -90.45%    | -92.53%    | -93.62%    | -93.98%    | -93.73%    | -94.17%    | -94.51%    |

| input                              | 2,048      | 4,096      | 8,192      | 16,384     | 32,768     | 65,536     |
| -----------------------------------| ---------- | ---------- | ---------- | ---------- | ---------- | ---------- |
| -XX:-UseCRC32 -XX:+UseNeon (pmull) | -88.92%    | -88.96%    | -88.97%    | -88.98%    | -88.96%    | -88.97%    |
| -XX:-UseCRC32 -XX:-UseNeon         | -94.58%    | -94.65%    | -94.68%    | -94.70%    | -94.70%    | -94.70%    |
  

`-XX:-UseCRC32 -XX:+UseNeon` (pmull)

TestCRC32.testCRC32Update         64  thrpt   12   20593.817 ±   8.676  ops/ms
TestCRC32.testCRC32Update        128  thrpt   12   13197.522 ±   4.268  ops/ms
TestCRC32.testCRC32Update        256  thrpt   12    7610.119 ±   6.329  ops/ms
TestCRC32.testCRC32Update        384  thrpt   12    5365.177 ±   0.570  ops/ms
TestCRC32.testCRC32Update        511  thrpt   12    3808.462 ±   3.441  ops/ms
TestCRC32.testCRC32Update        512  thrpt   12    4121.495 ±   4.549  ops/ms
TestCRC32.testCRC32Update       1024  thrpt   12    2144.889 ±   1.610  ops/ms
TestCRC32.testCRC32Update       2048  thrpt   12    1093.756 ±   0.114  ops/ms
TestCRC32.testCRC32Update       4096  thrpt   12     552.393 ±   0.164  ops/ms
TestCRC32.testCRC32Update       8192  thrpt   12     277.963 ±   0.099  ops/ms
TestCRC32.testCRC32Update      16384  thrpt   12     139.233 ±   0.061  ops/ms
TestCRC32.testCRC32Update      32768  thrpt   12      69.752 ±   0.013  ops/ms
TestCRC32.testCRC32Update      65536  thrpt   12      34.789 ±   0.002  ops/ms


`-XX:-UseCRC32 -XX:-UseNeon`

TestCRC32.testCRC32Update         64  thrpt   12   16541.230 ±   50.847  ops/ms
TestCRC32.testCRC32Update        128  thrpt   12    8432.515 ±    6.676  ops/ms
TestCRC32.testCRC32Update        256  thrpt   12    4252.450 ±    2.897  ops/ms
TestCRC32.testCRC32Update        384  thrpt   12    2844.691 ±    0.110  ops/ms
TestCRC32.testCRC32Update        511  thrpt   12    2114.317 ±    2.670  ops/ms
TestCRC32.testCRC32Update        512  thrpt   12    2134.668 ±    0.150  ops/ms
TestCRC32.testCRC32Update       1024  thrpt   12    1055.619 ±   27.552  ops/ms
TestCRC32.testCRC32Update       2048  thrpt   12     535.514 ±    0.273  ops/ms
TestCRC32.testCRC32Update       4096  thrpt   12     267.859 ±    0.133  ops/ms
TestCRC32.testCRC32Update       8192  thrpt   12     133.916 ±    0.014  ops/ms
TestCRC32.testCRC32Update      16384  thrpt   12      67.003 ±    0.034  ops/ms
TestCRC32.testCRC32Update      32768  thrpt   12      33.471 ±    0.023  ops/ms
TestCRC32.testCRC32Update      65536  thrpt   12      16.724 ±    0.018  ops/ms



> And do we today need three different hand-coded versions of CRC32?

Each version depends on different processor features. Removing the least efficient versions could degrade existing use cases.
It might be still reasonable to remove them though. The crc32c intrinsic only has one version, which uses crc32c instruction.

-------------

PR: https://git.openjdk.org/jdk/pull/12480


More information about the hotspot-dev mailing list