RFR: 8299544: Improve performance of CRC32C intrinsics (non-AVX-512) for small inputs [v2]

Andrei Pangin apangin at openjdk.org
Fri Jan 6 12:07:25 UTC 2023


> When comparing performance of JDK built-in CRC32C intrinsics with the native implementation, we noticed that JDK performs notably worse on smaller array sizes (up to 1024 bytes).
> 
> The main reason is that the algorithm based on the ["Fast CRC Computation for iSCSI Polynomial Using CRC32 Instruction"](https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/crc-iscsi-polynomial-crc32-instruction-paper.pdf) paper, processes input buffer in [chunks](https://github.com/openjdk/jdk/blob/92dfc735f2297441a99b3e39464fb8f77a354d55/src/hotspot/cpu/x86/crc32c.h#L29-L50) of 6144, 2064, and 648 bytes, and the rest of the buffer is processed sequentially with [4-byte variant](https://github.com/openjdk/jdk/blob/92dfc735f2297441a99b3e39464fb8f77a354d55/src/hotspot/cpu/x86/macroAssembler_x86.cpp#L8231-L8236) of CRC32 instruction.
> 
> Experiments show that CRC32C computation performance can be improved for smaller inputs without performance impact for larger inputs, if the smallest chunk size is reduced from 648 to 216 bytes. Actually, the original paper suggests the algorithms CRC_216 / CRC_240 for processing blocks of 216 and 240 bytes respectively. The value 648 might have resulted from an "x3" mistake (the algorithm accumulates 3 checksums per iteration).
> 
> Additionally, when processing the tail (less than 216 bytes), we can use 8-byte variant of CRC32 instruction instead of 4-byte variant to speed up the algorithm even further.
> 
> ### Changes in this PR
> 
> 1. CRC32C_LOW is decreased from 8 * 27 to 8 * 9 (which corresponds to 216 bytes).
> 2. CRC32C_MIDDLE is decreased from 8 * 86 to 8 * 74 to compensate for performance drop for mid-size arrays. The value 74 has been chosen empirically. According to experiments, values from 70 to 75 yield the best results, while with higher or lower values, there is a degradation for certain array sizes.
> 3. On x86_64, `CRC32 r32, [m32]` instruction is replaced with `CRC32 r64, [m64]` in the tail loop.
> 4. Unconditional JMP is replaced with a conditional jump at the end of the loop.
> 5. The hot tail loop is aligned.
> 
> ### Notes
> 
> - The changes are only for x86 architecture.
> - If the hardware supports AVX-512 extensions + VPCLMULQDQ instruction (introduced with Ice Lake), another version of CRC32C intrinsics is generated, and the changes do not have effect.
> 
> ### Performance
> 
> JMH benchmark used to assess performance: https://cr.openjdk.java.net/~apangin/8299544/CrcBench.java
> 
> Tested hardware:
>  - Intel Xeon Platinum 8259CL
>  - Intel Xeon Platinum 8124M
>  - Intel Core i7-1280P
>  - AMD EPYC 7251
> 
> Raw JMH results:
> https://cr.openjdk.java.net/~apangin/8299544/jmh/
> 
> Performance curve:
> <img width="892" alt="crc32c-perf-curve" src="https://cr.openjdk.java.net/~apangin/8299544/crc32c-perf-curve.png">
> 
> Improvement summary as a table:
> <img width="282" alt="crc32c-perf-table" src="https://cr.openjdk.java.net/~apangin/8299544/crc32c-perf-table.png">
> 
> and as a graph:
> <img width="488" alt="crc32c-perf-graph" src="https://cr.openjdk.java.net/~apangin/8299544/crc32c-perf-graph.png">
> 
> I also checked [TestCRC32C.java](https://github.com/openjdk/jdk/blob/872384707e89d03ede655aad16f360dc94f10152/test/micro/org/openjdk/bench/java/util/TestCRC32C.java) microbenchmark included in the JDK source code. It also shows similar improvement:
> 
> **Intel Xeon Platinum 8259CL**
> 
>                                                    ORIGINAL                         PATCHED
> Benchmark                    (count)   Mode  Cnt      Score     Error   Units         Score     Error      Diff
> TestCRC32C.testCRC32CUpdate       64  thrpt   12  61526.044 ±  14.426  ops/ms    116187.797 ±  37.719     88.8% 
> TestCRC32C.testCRC32CUpdate      128  thrpt   12  30043.410 ± 275.597  ops/ms     61527.139 ±  11.407    104.8%
> TestCRC32C.testCRC32CUpdate      256  thrpt   12  16086.548 ±   1.386  ops/ms     49801.463 ±   4.034    209.6%
> TestCRC32C.testCRC32CUpdate      512  thrpt   12   8107.780 ±   0.837  ops/ms     25506.806 ±   5.119    214.6%
> TestCRC32C.testCRC32CUpdate     1024  thrpt   12   8171.332 ±   0.684  ops/ms     12913.665 ±   1.425     58.0% 
> TestCRC32C.testCRC32CUpdate     2048  thrpt   12   8300.331 ±   1.848  ops/ms     10155.094 ±   1.551     22.3% 
> TestCRC32C.testCRC32CUpdate     4096  thrpt   12   4886.483 ±   2.027  ops/ms      5101.460 ±   1.001      4.4%  
> TestCRC32C.testCRC32CUpdate     8192  thrpt   12   2690.988 ±   0.315  ops/ms      2859.960 ±   0.807      6.3%  
> TestCRC32C.testCRC32CUpdate    16384  thrpt   12   1414.695 ±   0.477  ops/ms      1432.189 ±   0.162      1.2%  
> TestCRC32C.testCRC32CUpdate    32768  thrpt   12    671.593 ± 103.208  ops/ms       682.077 ± 106.374      1.6%  
> TestCRC32C.testCRC32CUpdate    65536  thrpt   12    340.002 ±  53.363  ops/ms       341.034 ±  53.581      0.3%  
> 
> 
> **AMD EPYC 7251**
> 
>                                                    ORIGINAL                       PATCHED
> Benchmark                    (count)   Mode  Cnt      Score    Error   Units        Score     Error      Diff
> TestCRC32C.testCRC32CUpdate       64  thrpt   12  32061.964 ± 26.971  ops/ms    53863.073 ± 133.610     68.0%
> TestCRC32C.testCRC32CUpdate      128  thrpt   12  17706.981 ± 19.582  ops/ms    32515.029 ±  14.303     83.6%
> TestCRC32C.testCRC32CUpdate      256  thrpt   12   8455.888 ±  5.070  ops/ms    19181.871 ±  35.279    126.8%
> TestCRC32C.testCRC32CUpdate      512  thrpt   12   4550.874 ±  1.293  ops/ms    10008.634 ±  17.191    119.9%
> TestCRC32C.testCRC32CUpdate     1024  thrpt   12   3636.205 ±  7.306  ops/ms     5122.156 ±   3.912     40.9%
> TestCRC32C.testCRC32CUpdate     2048  thrpt   12   2549.876 ± 90.854  ops/ms     2811.828 ±   5.924     10.3%
> TestCRC32C.testCRC32CUpdate     4096  thrpt   12   1335.905 ±  1.813  ops/ms     1389.812 ±   2.177      4.0%
> TestCRC32C.testCRC32CUpdate     8192  thrpt   12    717.871 ±  7.692  ops/ms      737.720 ±   0.357      2.8%
> TestCRC32C.testCRC32CUpdate    16384  thrpt   12    361.759 ±  1.755  ops/ms      364.348 ±   0.487      0.7%
> TestCRC32C.testCRC32CUpdate    32768  thrpt   12    183.699 ±  0.586  ops/ms      184.822 ±   0.178      0.6%
> TestCRC32C.testCRC32CUpdate    65536  thrpt   12     92.636 ±  0.133  ops/ms       92.850 ±   0.145      0.2%

Andrei Pangin has updated the pull request incrementally with one additional commit since the last revision:

  Updated copyright years

-------------

Changes:
  - all: https://git.openjdk.org/jdk/pull/11838/files
  - new: https://git.openjdk.org/jdk/pull/11838/files/9cdca84f..632db572

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=11838&range=01
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=11838&range=00-01

  Stats: 2 lines in 2 files changed: 0 ins; 0 del; 2 mod
  Patch: https://git.openjdk.org/jdk/pull/11838.diff
  Fetch: git fetch https://git.openjdk.org/jdk pull/11838/head:pull/11838

PR: https://git.openjdk.org/jdk/pull/11838


More information about the hotspot-dev mailing list