RFR JDK-6321472: Add CRC-32C API

Mon Oct 20 20:59:31 UTC 2014

Hi David,

>> The intrinsic that can be implemented for CRC32C will probably be even faster as it can make use of the specific crc32c instruction instead of using carry less multiplication.
> It might not hurt to mock up a quick benchmark using the CRC32C
> instruction against jdk8’s java.util.zip.CRC32 intrinsic.  CRC32
> seems to consume 64 bits per instruction, and I think each instruction
> depends on its predecessor (naively coded, at least; there’s probably
> a hack to run them in parallel, just as there is a hack that allows fork/join.
> So maybe we should think about that hack, too; simply computing the
> two half-CRCs in parallel, then combining them as if fork/join, might
> break the data dependence and allow higher speed).
>
> The carryless multiply inner loop also consumes 64 bits per carryless
> multiply, but the instructions are not data dependent among themselves,
> so it might go faster.  Haswell chips execute clmul especially quickly.
>
> The assembly language I wrote to make that go fast is agnostic about P;
> it is parameterized by a structure that contains a small number of constants
> that make it all go (6 64-bit numbers) based on powers of x (regarded as a
> polynomial) modulo the CRC polynomial P.  If the endianness conventions
> match, and if zip CRC runs faster than the CRC32 instruction, I could
> revive it, and modern tools might let us avoid coding in hex.
Yes, when an intrinsic is added it will definitely need to be compared 
the the current crc32. The calculation speed of the two should at least 
be the same, and hopefully we will be able to get crc32c even faster.  
Here is a paper describing a algorithm that uses the Intel crc32 
instruction, http://dl.acm.org/citation.cfm?id=2108885, and utilizes the 
fact that it is a 3 cycle instruction to execute 3 independent crc32 
instructions. (Presume it is using a similar algorithm to merge the 
results as you do in your fork/join implementation.)

>> I have also been thinking that using fork-join to help for large arrays. I decided to not go after it for this first implementation of the new API. The other question is if that should be something that is more explicit, similar to parallel streams, as not everyone might want/expect that the API to suddenly start using multiple threads to calculate the result.
> That’s why we have a system fork/join pool; if you don’t want lots of concurrency,
> limit the number of threads in that pool.  Soon enough people are going to
> yell at us for not using concurrency when it is available, and forcing them to
> use a different API just to get concurrency will hinder the ability to write portable
> code.

We definiately should try to do a parallel version that can further 
speed up Checksum's on very large arrays/buffers, but we probably need 
to discuss further and get input on how it should be enabled. Do we need 
to add an optional parallel method or do we simply make the current 
method parallel? I believe that is what needs to be discussed and 
decided on a more general level, as we might want to parallelize other 
APIs as well when possible and preferably should follow the same 
principle across the whole Java standard library.

//Staffan