RFR JDK-6321472: Add CRC-32C API

Sat Oct 18 02:52:55 UTC 2014

On 2014-10-17, at 5:22 PM, Staffan Friberg <staffan.friberg at oracle.com> wrote:
> 
> The polynomial used for multiplication is different so while, as with all CRCs, the algorithm is similar, but a different polynomial is used and the final checksum will be different.

Yes, I was misled by the comments.  Wikipedia makes all clear.
I thought maybe the polynomial itself was just bitflipped and the
whole thing was computed with a different endianness, depending
on whether you shift off the top or the bottom.

> The intrinsic that can be implemented for CRC32C will probably be even faster as it can make use of the specific crc32c instruction instead of using carry less multiplication.

It might not hurt to mock up a quick benchmark using the CRC32C
instruction against jdk8’s java.util.zip.CRC32 intrinsic.  CRC32
seems to consume 64 bits per instruction, and I think each instruction
depends on its predecessor (naively coded, at least; there’s probably
a hack to run them in parallel, just as there is a hack that allows fork/join.
So maybe we should think about that hack, too; simply computing the
two half-CRCs in parallel, then combining them as if fork/join, might
break the data dependence and allow higher speed).

The carryless multiply inner loop also consumes 64 bits per carryless
multiply, but the instructions are not data dependent among themselves,
so it might go faster.  Haswell chips execute clmul especially quickly.

The assembly language I wrote to make that go fast is agnostic about P;
it is parameterized by a structure that contains a small number of constants
that make it all go (6 64-bit numbers) based on powers of x (regarded as a
polynomial) modulo the CRC polynomial P.  If the endianness conventions
match, and if zip CRC runs faster than the CRC32 instruction, I could
revive it, and modern tools might let us avoid coding in hex.

> I have also been thinking that using fork-join to help for large arrays. I decided to not go after it for this first implementation of the new API. The other question is if that should be something that is more explicit, similar to parallel streams, as not everyone might want/expect that the API to suddenly start using multiple threads to calculate the result.

That’s why we have a system fork/join pool; if you don’t want lots of concurrency,
limit the number of threads in that pool.  Soon enough people are going to
yell at us for not using concurrency when it is available, and forcing them to
use a different API just to get concurrency will hinder the ability to write portable
code.

David