RFR: 8302113: Improve CRC32 intrinsic with crypto pmull on AArch64

Andrew Haley aph at openjdk.org
Sun Feb 12 14:54:31 UTC 2023


On Thu, 9 Feb 2023 02:25:27 GMT, Yi-Fan Tsai <duke at openjdk.org> wrote:

> Instruction pmull and pmull2 support operating on 64-bit data in Cryptographic Extension. The execution throughput of this form raises from 1 on Neoverse N1 to 4 on Neoverse V1 while the latency remains 2. The CRC32 instructions did not changed: latency 2, throughput 1. As a result, computing CRC32 using pmull could perform better than using crc32 instruction.
> 
> The following test has passed.
> test/hotspot/jtreg/compiler/intrinsics/zip/TestCRC32.java
> 
> The throughput reported by the micro benchmark is measured on an EC2 c7g instance. The optimization shows 11 - 99% improvement when the input is at least 384 bytes.
> 
> | input               | 64         | 128        | 256        | 384        | 511        | 512        | 1,024      |
> | ------------- | ---------- | ---------- | ---------- | ---------- | ---------- | ---------- | ---------- |
> |  improvement  | 0.02%      | 0.02%      | 0.00%      | 16.00%     | 11.94%     | 34.75%     | 69.80%     |
> 
> | input               | 2,048      | 4,096      | 8,192      | 16,384     | 32,768     | 65,536     |
> | ------------- | ---------- | ---------- | ---------- | ---------- | ---------- | ---------- |
> |  improvement  | 77.61%     | 92.33%     | 95.98%     | 97.95%     | 99.33%     | 98.36%     |
> 
> 
> Baseline
> 
> TestCRC32.testCRC32Update         64  thrpt   12  173126.358 ± 118.330  ops/ms
> TestCRC32.testCRC32Update        128  thrpt   12  112910.118 ±  47.305  ops/ms
> TestCRC32.testCRC32Update        256  thrpt   12   66601.990 ±   7.294  ops/ms
> TestCRC32.testCRC32Update        384  thrpt   12   47229.319 ±   3.949  ops/ms
> TestCRC32.testCRC32Update        511  thrpt   12   33733.119 ±   4.076  ops/ms
> TestCRC32.testCRC32Update        512  thrpt   12   36584.565 ±   4.211  ops/ms
> TestCRC32.testCRC32Update       1024  thrpt   12   19239.083 ±   1.040  ops/ms
> TestCRC32.testCRC32Update       2048  thrpt   12    9875.652 ±   0.435  ops/ms
> TestCRC32.testCRC32Update       4096  thrpt   12    5004.425 ±   0.290  ops/ms
> TestCRC32.testCRC32Update       8192  thrpt   12    2519.185 ±   0.169  ops/ms
> TestCRC32.testCRC32Update      16384  thrpt   12    1263.909 ±   0.194  ops/ms
> TestCRC32.testCRC32Update      32768  thrpt   12     632.018 ±   0.053  ops/ms
> TestCRC32.testCRC32Update      65536  thrpt   12     315.471 ±   0.095  ops/ms
> 
> 
> Crypto pmull
> 
> TestCRC32.testCRC32Update         64  thrpt   12  173168.669 ±   4.746  ops/ms
> TestCRC32.testCRC32Update        128  thrpt   12  112933.519 ±   4.583  ops/ms
> TestCRC32.testCRC32Update        256  thrpt   12   66602.462 ±   3.150  ops/ms
> TestCRC32.testCRC32Update        384  thrpt   12   54784.739 ±   2.110  ops/ms
> TestCRC32.testCRC32Update        511  thrpt   12   37760.816 ±  69.911  ops/ms
> TestCRC32.testCRC32Update        512  thrpt   12   49297.609 ±  21.983  ops/ms
> TestCRC32.testCRC32Update       1024  thrpt   12   32667.507 ±  90.610  ops/ms
> TestCRC32.testCRC32Update       2048  thrpt   12   17539.986 ± 511.416  ops/ms
> TestCRC32.testCRC32Update       4096  thrpt   12    9625.249 ±   9.713  ops/ms
> TestCRC32.testCRC32Update       8192  thrpt   12    4937.135 ±   6.121  ops/ms
> TestCRC32.testCRC32Update      16384  thrpt   12    2501.936 ±   1.270  ops/ms
> TestCRC32.testCRC32Update      32768  thrpt   12    1259.831 ±   0.119  ops/ms
> TestCRC32.testCRC32Update      65536  thrpt   12     625.773 ±   0.242  ops/ms

The optional CRC instructions in v8.0 become a requirement in ARMv8.1. ARMv8.0 is stuff like Cortex A53, but also including, apparently, Cortex A72. So I guess we're stuck with all three, at least for now. This is a bad business, but I guess it's something that Arm & partners have dropped on us, and there's little we can do about it.

-------------

PR: https://git.openjdk.org/jdk/pull/12480


More information about the hotspot-dev mailing list