RFR: 8358032: Use crypto pmull for CRC32/CRC32C intrinsics on Ampere CPU [v2]

Emanuel Peter epeter at openjdk.org
Wed Jun 4 08:33:20 UTC 2025


On Wed, 4 Jun 2025 03:30:11 GMT, Liming Liu <lliu at openjdk.org> wrote:

>> This PR is to enable the use of crypto pmull for CRC32/CRC32C intrinsics on Ampere CPU. There is an option UseCryptoPmullForCRC32 that can enable crypto pmull, but directly enabling it on Ampere CPU will cause the following problems.
>> 
>> 1. There will be regressions (-14% ~ -8%) on Ampere1 when the length is 64. When <= 128, both kernel_crc32_using_crc32 and kernel_crc32_using_crypto_pmull use the loop labeled as CRC_by32_loop, but their implements are a little different, and the loop in kernel_crc32_using_crc32 is better at hiding latency on Ampere1. So this PR takes the loop in kernel_crc32_using_crc32 to kernel_crc32_using_crypto_pmull, and does the same for CRC32C intrinsic.
>> 
>> 2. The intrinsics only use crypto pmull when the length is higher than 383, while the loop in kernel_crc32_common_fold_using_crypto_pmull looks able to handle 256, and if it handles 256 on Ampere1, the improvements can be as high as 110% compared with kernel_crc32_using_crc32/kernel_crc32c_using_crc32c. However, there are regressions (~-6%) on Neoverse V1 when the length is 256. So this PR introduces a new option named CryptoPmullForCRC32LowLimit. It defaults to 256 since the code could handle 256, while it is set to 384 for V1/V2 to keep the old behavior on these platforms.
>> 
>> The performance regressions and improvements were measured with the following microbenchmarks:
>> org.openjdk.bench.java.util.TestCRC32.testCRC32Update
>> org.openjdk.bench.java.util.TestCRC32C.testCRC32CUpdate
>> 
>> Ran the following JTReg tests on Ampere1 and did not find problems:
>> test/hotspot/jtreg/compiler/intrinsics/zip/TestCRC32.java
>> test/hotspot/jtreg/compiler/intrinsics/zip/TestCRC32C.java
>
> Liming Liu has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Make it be a diagnostic flag

@limingliu-ampere Thanks for working on this! 😊 

Generally looks reasonable  to me as a non expert in crypto intrinsics. But we definitively need an expert to approve this in the end. I have a few comments below.

Also: it would be nice to have a sanity test where you use that new flag. It could also be an additional run in an existing test (that's probably even better). You may want to run it with a few different values, including non-multiple of `128` just to sanity check the alignment correction as well. I don't know how much runtime that would add, so that should be checked before going too crazy. Having different values for the flag helps us to simulate the behavior of other hardware for example, and that can be quite useful in general. What do you think?

src/hotspot/cpu/aarch64/globals_aarch64.hpp line 95:

> 93:           "Minimum size in bytes when Crypto PMULL will be used."       \
> 94:           "Value must be a multiple of 128.")                           \
> 95:           range(256, max_jint)                                          \

Is it sane to have negative values? If not, use `uintx`... or maybe even just `uint`?

src/hotspot/cpu/aarch64/macroAssembler_aarch64.cpp line 4335:

> 4333:     assert_different_registers(crc, buf, len, tmp0, tmp1, tmp2);
> 4334: 
> 4335:     subs(tmp0, len, CryptoPmullForCRC32LowLimit);

Would it make sense to have another alignment sanity check here? It would be both helpful to make sure nobody later breaks your assumption, and could also be helpful for the reader to see the `128` alignment immediately.

-------------

Changes requested by epeter (Reviewer).

PR Review: https://git.openjdk.org/jdk/pull/25609#pullrequestreview-2895780805
PR Review Comment: https://git.openjdk.org/jdk/pull/25609#discussion_r2125999298
PR Review Comment: https://git.openjdk.org/jdk/pull/25609#discussion_r2126003055


More information about the hotspot-dev mailing list