RFR JDK-6321472: Add CRC-32C API

Fri Oct 17 21:22:22 UTC 2014

Hi David,

Not sure what you mean with a bit flip. Calculating CRC32 and then 
flipping the result in some way?

The polynomial used for multiplication is different so while, as with 
all CRCs, the algorithm is similar, but a different polynomial is used 
and the final checksum will be different.

The intrinsic that can be implemented for CRC32C will probably be even 
faster as it can make use of the specific crc32c instruction instead of 
using carry less multiplication.

I have also been thinking that using fork-join to help for large arrays. 
I decided to not go after it for this first implementation of the new 
API. The other question is if that should be something that is more 
explicit, similar to parallel streams, as not everyone might want/expect 
that the API to suddenly start using multiple threads to calculate the 
result.

Cheers,
Staffan

On 10/17/2014 02:06 PM, David Chase wrote:
> See also: http://cr.openjdk.java.net/~drchase/7088419/webrev.01/
>
> This is the version that was not coded as an intrinsic, that also included
> fork/join.  The crazy Intel instructions are accessed from native code,
> so you can get a feel for what the code looks like before it is converted
> to an intrinsic.  There’s a certain amount of brain-hurt involved in the
> fork/join code, but it works.
>
> I’m still trying to figure out if the whole thing is just bit-flipped.
>
> David
>
> On 2014-10-17, at 4:50 PM, David Chase <david.r.chase at oracle.com> wrote:
>
>> On 2014-10-17, at 2:53 PM, Staffan Friberg <staffan.friberg at oracle.com> wrote:
>>
>>> Fully agree that using Unsafe makes one sad.
>>>
>>> I'm just about to send out a new webrev with Alan's and Peter's comments, once I have that done I will give using the NIO-Buffer API a second try to see if using IntBuffer and LongBuffer is able to achieve similar performance.
>>>
>>> As I noted in my reply the second goal after adding this API will is to create intrinsics that make use of the crc32c instructions available on both x86 and SPARC which will bump the performance even further. So one thing I try to do is make sure the implementation makes it easy to do that without having to completely rewrite it again.
>> I’d like to review this, but it will take me a little bit of time.
>> Recall that I did a lot of work on CRC for Intel to take advantage
>> of the carryless multiply instructions for CRC32.
>>
>> Reading the comments, if I understand this properly, the difference between
>> CRC32 and CRC32C is that CRC32C is just the bit or byte flip of CRC32 as
>> we currently compute it.   If so, wouldn’t it make more sense to not reinvent
>> that rather tricky wheel?  The code you have carefully written will run substantially
>> slower than CRC32 on recent Intel hardware (Haswell and newer in particular)
>> because there an intrinsic is already substituted.
>>
>> Can you verify whether this bit/byte flipping equivalence holds, or not?
>>
>> If we were interested in true peak performance, we’d also investigate
>> fork/join parallelism; I did this once and it worked just fine if you made the
>> block sizes large enough.
>>
>> David
>>