RFR: JDK-8300584: Accelerate AVX-512 CRC32C for small buffers

Scott Gibbons duke at openjdk.org
Wed Jan 18 22:19:24 UTC 2023


On Wed, 18 Jan 2023 22:02:03 GMT, Vladimir Kozlov <kvn at openjdk.org> wrote:

>> Use AVX2 code for CRC32C for small buffers in the AVX-512 path.  Breakeven buffer size between the two algorithms is on the order of 384 bytes.
>> 
>> **Performance numbers for previous:**
>> 
>> Benchmark                    (count)   Mode  Cnt      Score     Error   Units
>> TestCRC32C.testCRC32CUpdate       64  thrpt    4  66974.957 ±   8.872  ops/ms
>> TestCRC32C.testCRC32CUpdate      128  thrpt    4  44224.810 ±  11.801  ops/ms
>> TestCRC32C.testCRC32CUpdate      256  thrpt    4  63997.611 ± 173.577  ops/ms
>> TestCRC32C.testCRC32CUpdate      512  thrpt    4  56068.683 ± 269.582  ops/ms
>> TestCRC32C.testCRC32CUpdate     2048  thrpt    4  27048.098 ±  87.350  ops/ms
>> TestCRC32C.testCRC32CUpdate    16384  thrpt    4   4066.736 ±  10.318  ops/ms
>> TestCRC32C.testCRC32CUpdate    65536  thrpt    4   1040.754 ±   6.419  ops/ms
>> 
>> 
>> **Performance numbers for this version:**
>> 
>> Benchmark                    (count)   Mode  Cnt       Score     Error   Units
>> TestCRC32C.testCRC32CUpdate       64  thrpt    3  161659.326 ±  74.974  ops/ms
>> TestCRC32C.testCRC32CUpdate      128  thrpt    3   88456.935 ±  11.940  ops/ms
>> TestCRC32C.testCRC32CUpdate      256  thrpt    3   73254.993 ±   5.004  ops/ms
>> TestCRC32C.testCRC32CUpdate      512  thrpt    3   56508.541 ± 298.229  ops/ms
>> TestCRC32C.testCRC32CUpdate     2048  thrpt    3   26701.995 ±  31.369  ops/ms
>> TestCRC32C.testCRC32CUpdate    16384  thrpt    3    4110.819 ±   4.618  ops/ms
>> TestCRC32C.testCRC32CUpdate    65536  thrpt    3    1045.821 ±   2.037  ops/ms
>
> We should avoid duplicating code. Since `crc32c_ipl_alg2_alt2` is used in both cases we can reshape code like next (pseudo code):
> 
> 
>   if (supports_avx512()) {
>     if (len > 384) {
>        kernel_crc32_avx512();
>        jmp Exit;
>      }
>   }
>   crc32c_ipl_alg2_alt2();
> Exit:
> ...

@vnkozlov - There is really no code duplication here.  This is generating code, so the `if(supports_avx512())` construct is directing which code is generated for the intrinsic.  That is, it either generates the AVX-512 kernel -OR- it generates the AVX2 routine.  I need *both* generated for this fix.  In the original code, there is no fall-through, so the `crc32c_ipl_alg2_alt2()` block can't be jumped to (because it's never generated).  In the case where AVX-512 is not a capability of the platform, only `crc32c_ipl_alg2_alt2()` is generated.

In any case, the check on `len` is a *runtime* check, not a code-generation check.  Does this make sense?

-------------

PR: https://git.openjdk.org/jdk/pull/12079


More information about the hotspot-dev mailing list