RFR: 8299544: Improve performance of CRC32C intrinsics (non-AVX-512) for small inputs [v2]
Andrei Pangin
apangin at openjdk.org
Fri Jan 6 12:07:25 UTC 2023
> When comparing performance of JDK built-in CRC32C intrinsics with the native implementation, we noticed that JDK performs notably worse on smaller array sizes (up to 1024 bytes).
>
> The main reason is that the algorithm based on the ["Fast CRC Computation for iSCSI Polynomial Using CRC32 Instruction"](https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/crc-iscsi-polynomial-crc32-instruction-paper.pdf) paper, processes input buffer in [chunks](https://github.com/openjdk/jdk/blob/92dfc735f2297441a99b3e39464fb8f77a354d55/src/hotspot/cpu/x86/crc32c.h#L29-L50) of 6144, 2064, and 648 bytes, and the rest of the buffer is processed sequentially with [4-byte variant](https://github.com/openjdk/jdk/blob/92dfc735f2297441a99b3e39464fb8f77a354d55/src/hotspot/cpu/x86/macroAssembler_x86.cpp#L8231-L8236) of CRC32 instruction.
>
> Experiments show that CRC32C computation performance can be improved for smaller inputs without performance impact for larger inputs, if the smallest chunk size is reduced from 648 to 216 bytes. Actually, the original paper suggests the algorithms CRC_216 / CRC_240 for processing blocks of 216 and 240 bytes respectively. The value 648 might have resulted from an "x3" mistake (the algorithm accumulates 3 checksums per iteration).
>
> Additionally, when processing the tail (less than 216 bytes), we can use 8-byte variant of CRC32 instruction instead of 4-byte variant to speed up the algorithm even further.
>
> ### Changes in this PR
>
> 1. CRC32C_LOW is decreased from 8 * 27 to 8 * 9 (which corresponds to 216 bytes).
> 2. CRC32C_MIDDLE is decreased from 8 * 86 to 8 * 74 to compensate for performance drop for mid-size arrays. The value 74 has been chosen empirically. According to experiments, values from 70 to 75 yield the best results, while with higher or lower values, there is a degradation for certain array sizes.
> 3. On x86_64, `CRC32 r32, [m32]` instruction is replaced with `CRC32 r64, [m64]` in the tail loop.
> 4. Unconditional JMP is replaced with a conditional jump at the end of the loop.
> 5. The hot tail loop is aligned.
>
> ### Notes
>
> - The changes are only for x86 architecture.
> - If the hardware supports AVX-512 extensions + VPCLMULQDQ instruction (introduced with Ice Lake), another version of CRC32C intrinsics is generated, and the changes do not have effect.
>
> ### Performance
>
> JMH benchmark used to assess performance: https://cr.openjdk.java.net/~apangin/8299544/CrcBench.java
>
> Tested hardware:
> - Intel Xeon Platinum 8259CL
> - Intel Xeon Platinum 8124M
> - Intel Core i7-1280P
> - AMD EPYC 7251
>
> Raw JMH results:
> https://cr.openjdk.java.net/~apangin/8299544/jmh/
>
> Performance curve:
> <img width="892" alt="crc32c-perf-curve" src="https://cr.openjdk.java.net/~apangin/8299544/crc32c-perf-curve.png">
>
> Improvement summary as a table:
> <img width="282" alt="crc32c-perf-table" src="https://cr.openjdk.java.net/~apangin/8299544/crc32c-perf-table.png">
>
> and as a graph:
> <img width="488" alt="crc32c-perf-graph" src="https://cr.openjdk.java.net/~apangin/8299544/crc32c-perf-graph.png">
>
> I also checked [TestCRC32C.java](https://github.com/openjdk/jdk/blob/872384707e89d03ede655aad16f360dc94f10152/test/micro/org/openjdk/bench/java/util/TestCRC32C.java) microbenchmark included in the JDK source code. It also shows similar improvement:
>
> **Intel Xeon Platinum 8259CL**
>
> ORIGINAL PATCHED
> Benchmark (count) Mode Cnt Score Error Units Score Error Diff
> TestCRC32C.testCRC32CUpdate 64 thrpt 12 61526.044 ± 14.426 ops/ms 116187.797 ± 37.719 88.8%
> TestCRC32C.testCRC32CUpdate 128 thrpt 12 30043.410 ± 275.597 ops/ms 61527.139 ± 11.407 104.8%
> TestCRC32C.testCRC32CUpdate 256 thrpt 12 16086.548 ± 1.386 ops/ms 49801.463 ± 4.034 209.6%
> TestCRC32C.testCRC32CUpdate 512 thrpt 12 8107.780 ± 0.837 ops/ms 25506.806 ± 5.119 214.6%
> TestCRC32C.testCRC32CUpdate 1024 thrpt 12 8171.332 ± 0.684 ops/ms 12913.665 ± 1.425 58.0%
> TestCRC32C.testCRC32CUpdate 2048 thrpt 12 8300.331 ± 1.848 ops/ms 10155.094 ± 1.551 22.3%
> TestCRC32C.testCRC32CUpdate 4096 thrpt 12 4886.483 ± 2.027 ops/ms 5101.460 ± 1.001 4.4%
> TestCRC32C.testCRC32CUpdate 8192 thrpt 12 2690.988 ± 0.315 ops/ms 2859.960 ± 0.807 6.3%
> TestCRC32C.testCRC32CUpdate 16384 thrpt 12 1414.695 ± 0.477 ops/ms 1432.189 ± 0.162 1.2%
> TestCRC32C.testCRC32CUpdate 32768 thrpt 12 671.593 ± 103.208 ops/ms 682.077 ± 106.374 1.6%
> TestCRC32C.testCRC32CUpdate 65536 thrpt 12 340.002 ± 53.363 ops/ms 341.034 ± 53.581 0.3%
>
>
> **AMD EPYC 7251**
>
> ORIGINAL PATCHED
> Benchmark (count) Mode Cnt Score Error Units Score Error Diff
> TestCRC32C.testCRC32CUpdate 64 thrpt 12 32061.964 ± 26.971 ops/ms 53863.073 ± 133.610 68.0%
> TestCRC32C.testCRC32CUpdate 128 thrpt 12 17706.981 ± 19.582 ops/ms 32515.029 ± 14.303 83.6%
> TestCRC32C.testCRC32CUpdate 256 thrpt 12 8455.888 ± 5.070 ops/ms 19181.871 ± 35.279 126.8%
> TestCRC32C.testCRC32CUpdate 512 thrpt 12 4550.874 ± 1.293 ops/ms 10008.634 ± 17.191 119.9%
> TestCRC32C.testCRC32CUpdate 1024 thrpt 12 3636.205 ± 7.306 ops/ms 5122.156 ± 3.912 40.9%
> TestCRC32C.testCRC32CUpdate 2048 thrpt 12 2549.876 ± 90.854 ops/ms 2811.828 ± 5.924 10.3%
> TestCRC32C.testCRC32CUpdate 4096 thrpt 12 1335.905 ± 1.813 ops/ms 1389.812 ± 2.177 4.0%
> TestCRC32C.testCRC32CUpdate 8192 thrpt 12 717.871 ± 7.692 ops/ms 737.720 ± 0.357 2.8%
> TestCRC32C.testCRC32CUpdate 16384 thrpt 12 361.759 ± 1.755 ops/ms 364.348 ± 0.487 0.7%
> TestCRC32C.testCRC32CUpdate 32768 thrpt 12 183.699 ± 0.586 ops/ms 184.822 ± 0.178 0.6%
> TestCRC32C.testCRC32CUpdate 65536 thrpt 12 92.636 ± 0.133 ops/ms 92.850 ± 0.145 0.2%
Andrei Pangin has updated the pull request incrementally with one additional commit since the last revision:
Updated copyright years
-------------
Changes:
- all: https://git.openjdk.org/jdk/pull/11838/files
- new: https://git.openjdk.org/jdk/pull/11838/files/9cdca84f..632db572
Webrevs:
- full: https://webrevs.openjdk.org/?repo=jdk&pr=11838&range=01
- incr: https://webrevs.openjdk.org/?repo=jdk&pr=11838&range=00-01
Stats: 2 lines in 2 files changed: 0 ins; 0 del; 2 mod
Patch: https://git.openjdk.org/jdk/pull/11838.diff
Fetch: git fetch https://git.openjdk.org/jdk pull/11838/head:pull/11838
PR: https://git.openjdk.org/jdk/pull/11838
More information about the hotspot-dev
mailing list