RFR: 8302783: Improve CRC32C intrinsic with crypto pmull on AArch64

Paul Hohensee phh at openjdk.org
Thu Mar 2 22:26:11 UTC 2023


On Fri, 17 Feb 2023 19:59:24 GMT, Yi-Fan Tsai <duke at openjdk.org> wrote:

> This change adds a pmull-based CRC32C intrinsic, and it is more performant than the existing crc32c-instruction-based intrinsic on Neoverse V1. The benchmark shows 10 - 99% improvement. The improvement comes from the execution throughput increase of pmull/pmull2 from 1 on Neoverse N1 to 4 on Neoverse V1 while the latency remains 2 while the throughput of CRC32C instructions did not changed. 
> 
> The pmull-based CRC32C intrinsic is enabled by the existing option UseCryptoPmullForCRC32 which also enables the pmull-based CRC32 intrinsic. The option requires crc32c instructions, eor3 in SHA3, and 64-bit pmull/pmull2 in Cryptographic Extension.
> 
> With this change, there will be only two different CRC32C intrinsics, crc32c and pmull, while there are four CRC32 intrinsics.
> 
> The following test has passed.
> test/hotspot/jtreg/compiler/intrinsics/zip/TestCRC32C.java
> 
> The throughput reported by [the micro benchmark](https://github.com/openjdk/jdk/blob/master/test/micro/org/openjdk/bench/java/util/TestCRC32C.java) is measured on an EC2 c7g instance. The optimization shows 10 - 99% improvement when the input is at least 384 bytes.
> 
> | input               | 64         | 128        | 256        | 384        | 511        | 512        | 1,024      |
> | ------------- | ---------- | ---------- | ---------- | ---------- | ---------- | ---------- | ---------- |
> |  improvement  | 1.60%      | 0.00%      | 0.00%      | 15.24%     | 10.76%     | 34.32%     | 72.39%     |
> 
> | input               | 2,048      | 4,096      | 8,192      | 16,384     | 32,768     | 65,536     |
> | ------------- | ---------- | ---------- | ---------- | ---------- | ---------- | ---------- |
> |  improvement  | 84.96%     | 92.59%     | 96.19%     | 98.02%     | 99.32%     | 98.36%     |
> 
> 
> Baseline
> 
> Benchmark                    (count)   Mode  Cnt       Score      Error   Units
> TestCRC32C.testCRC32CUpdate       64  thrpt   12  196575.739 ± 1824.113  ops/ms
> TestCRC32C.testCRC32CUpdate      128  thrpt   12  123666.570 ±    2.730  ops/ms
> TestCRC32C.testCRC32CUpdate      256  thrpt   12   70188.989 ±    2.002  ops/ms
> TestCRC32C.testCRC32CUpdate      384  thrpt   12   49000.690 ±    1.421  ops/ms
> TestCRC32C.testCRC32CUpdate      511  thrpt   12   34106.279 ±   25.390  ops/ms
> TestCRC32C.testCRC32CUpdate      512  thrpt   12   37638.349 ±    1.039  ops/ms
> TestCRC32C.testCRC32CUpdate     1024  thrpt   12   19526.513 ±    0.439  ops/ms
> TestCRC32C.testCRC32CUpdate     2048  thrpt   12    9951.392 ±    4.803  ops/ms
> TestCRC32C.testCRC32CUpdate     4096  thrpt   12    5023.268 ±    0.240  ops/ms
> TestCRC32C.testCRC32CUpdate     8192  thrpt   12    2523.877 ±    0.062  ops/ms
> TestCRC32C.testCRC32CUpdate    16384  thrpt   12    1265.011 ±    0.047  ops/ms
> TestCRC32C.testCRC32CUpdate    32768  thrpt   12     632.291 ±    0.058  ops/ms
> TestCRC32C.testCRC32CUpdate    65536  thrpt   12     315.396 ±    0.160  ops/ms
> 
> 
> Crypto pmull
> 
> Benchmark                    (count)   Mode  Cnt       Score     Error   Units
> TestCRC32C.testCRC32CUpdate       64  thrpt   12  199726.599 ± 166.477  ops/ms
> TestCRC32C.testCRC32CUpdate      128  thrpt   12  123669.385 ±   1.821  ops/ms
> TestCRC32C.testCRC32CUpdate      256  thrpt   12   70188.727 ±   1.313  ops/ms
> TestCRC32C.testCRC32CUpdate      384  thrpt   12   56468.837 ±  76.524  ops/ms
> TestCRC32C.testCRC32CUpdate      511  thrpt   12   37777.205 ± 406.431  ops/ms
> TestCRC32C.testCRC32CUpdate      512  thrpt   12   50554.555 ±  17.169  ops/ms
> TestCRC32C.testCRC32CUpdate     1024  thrpt   12   33661.006 ± 140.471  ops/ms
> TestCRC32C.testCRC32CUpdate     2048  thrpt   12   18406.482 ± 205.952  ops/ms
> TestCRC32C.testCRC32CUpdate     4096  thrpt   12    9674.159 ±  20.390  ops/ms
> TestCRC32C.testCRC32CUpdate     8192  thrpt   12    4951.562 ±   6.566  ops/ms
> TestCRC32C.testCRC32CUpdate    16384  thrpt   12    2504.970 ±   1.883  ops/ms
> TestCRC32C.testCRC32CUpdate    32768  thrpt   12    1260.278 ±   0.484  ops/ms
> TestCRC32C.testCRC32CUpdate    65536  thrpt   12     625.608 ±   0.300  ops/ms

Lgtm.

The linux-x86 pre-submit test failure is caused by a test using -XX:+UseCompressedClassPointers, which is an invalid switch for 32-bit JVMs.

The linux-cross-compile pre-submit test failure is a compile-time failure in src/hotspot/cpu/arm/interpreterRT_arm.cpp, which latter is not touched by this patch.

-------------

Marked as reviewed by phh (Reviewer).

PR: https://git.openjdk.org/jdk/pull/12624


More information about the hotspot-dev mailing list