RFR JDK-6321472: Add CRC-32C API

Wed Oct 22 17:54:56 UTC 2014

You're right, managed to convince myself here as well. Will change it to <=.

An unfortunate side-effect seems to be that in a benchmark with 1024 
long array it seems like the performance drops when the tail loop is not 
utilized.
As you can see below there is a slight drop when the data is perfectly 
aligned in the array causing neither alignment or tail loop to be utilized.
Not going to dig into why right now, but an interesting effect.

     @Benchmark
     @OperationsPerInvocation(1024)
     public void byteArray_1K() {
         checksum.update(unalignedByteArray_1k, offset, 1024);
     }

Benchmark                           (offset)   Mode  Samples        Score  Score error   Units
o.c.CRC32CUnaligned.byteArray_1K           0  thrpt        5   983198.048      654.467  ops/ms
o.c.CRC32CUnaligned.byteArray_1K           1  thrpt        5  1013136.714     3801.537  ops/ms
o.c.CRC32CUnaligned.byteArray_1K           2  thrpt        5  1006899.265    10998.612  ops/ms
o.c.CRC32CUnaligned.byteArray_1K           3  thrpt        5  1014419.366     3745.875  ops/ms
o.c.CRC32CUnaligned.byteArray_1K           4  thrpt        5  1007777.937     1585.989  ops/ms
o.c.CRC32CUnaligned.byteArray_1K           5  thrpt        5  1011249.564      852.333  ops/ms
o.c.CRC32CUnaligned.byteArray_1K           6  thrpt        5  1013622.327     4075.303  ops/ms
o.c.CRC32CUnaligned.byteArray_1K           7  thrpt        5  1008137.616     1672.262  ops/ms
o.c.CRC32CUnaligned.byteArray_1K           8  thrpt        5   985838.295      448.735  ops/ms
o.c.CRC32CUnaligned.byteArray_1K           9  thrpt        5  1005935.356     4250.381  ops/ms
o.c.CRC32CUnaligned.byteArray_1K          10  thrpt        5   995912.149     2447.272  ops/ms

With just using < then the performance is equal across all alignments.

Benchmark                           (offset)   Mode  Samples        Score  Score error   Units
o.c.CRC32CUnaligned.byteArray_1K           0  thrpt        5  1003094.419     2656.518  ops/ms
o.c.CRC32CUnaligned.byteArray_1K           1  thrpt        5  1009641.996     2519.614  ops/ms
o.c.CRC32CUnaligned.byteArray_1K           2  thrpt        5  1009169.464    12987.786  ops/ms
o.c.CRC32CUnaligned.byteArray_1K           3  thrpt        5  1006869.144     1617.988  ops/ms
o.c.CRC32CUnaligned.byteArray_1K           4  thrpt        5  1005490.644     1234.283  ops/ms
o.c.CRC32CUnaligned.byteArray_1K           5  thrpt        5  1007468.090      327.541  ops/ms
o.c.CRC32CUnaligned.byteArray_1K           6  thrpt        5  1014316.244     1174.920  ops/ms
o.c.CRC32CUnaligned.byteArray_1K           7  thrpt        5  1008378.002     1440.734  ops/ms
o.c.CRC32CUnaligned.byteArray_1K           8  thrpt        5  1003253.408    10786.210  ops/ms
o.c.CRC32CUnaligned.byteArray_1K           9  thrpt        5  1000950.362      250.617  ops/ms
o.c.CRC32CUnaligned.byteArray_1K          10  thrpt        5  1003783.897    11333.132  ops/ms

//Staffan

On 10/21/2014 10:56 PM, Peter Levart wrote:

> On 10/21/2014 11:34 PM, Staffan Friberg wrote:
>> I believe it must be <, as it is in the tail loop as well, because 
>> end is (off+len or limit) so end is exclusive, similar to 
>> subString(begin,end).
>>
>> Makes sense?
>>
>> //Staffan
>>
>> On 10/21/2014 01:46 PM, Peter Levart wrote:
>>> Sorry Staffan, another nit...
>>>
>>>   212             for (; off < (end - Long.BYTES); off += Long.BYTES) {
>>> and
>>>
>>>   286             for (; off < (end - Long.BYTES); off += Long.BYTES) {
>>>
>>>
>>> I think you could change "off < (end - Long.BYTES)" to "off <= (end 
>>> - Long.BYTES)". Am I right?
>
> The tail loop has < :
>
>  319         for (; off < end; off++) {
>
>
> ...but it could be written as:
>
>
> for (; off <= (end - Byte.BYTES); off += Byte.BYTES) { ...
>
>
> ;-)
>
> In other words, when off == end - Long.BYTES, you can still read 
> Long.BYTES starting at 'off' .
>
> Regards, Peter
>
>
>>>
>>> Regards, Peter
>>>
>>>
>>> On 10/21/2014 10:30 PM, Peter Levart wrote:
>>>>
>>>> On 10/21/2014 08:49 PM, Staffan Friberg wrote:
>>>>> Hi,
>>>>>
>>>>> Got an offline comment that the package.html should be update as 
>>>>> well to cover CRC-32C.
>>>>>
>>>>> Otherwise there are no code changes in this new webrev.
>>>>>
>>>>> http://cr.openjdk.java.net/~sfriberg/JDK-6321472/webrev.04
>>>>
>>>>
>>>> Hi Staffan,
>>>>
>>>>   198         if (end - off >= 8 && 
>>>> Unsafe.ARRAY_BOOLEAN_INDEX_SCALE == 1) {
>>>>
>>>> ARRAY_BOOLEAN_INDEX_SCALE -> ARRAY_BYTE_INDEX_SCALE
>>>>
>>>>
>>>> Otherwise looks good now.
>>>>
>>>> Regards, Peter
>>>>
>>>> P.S.
>>>>
>>>> I think (by looking at DirectByteBuffer.asIntBuffer() 
>>>> implementation) that when doing 32 bit (4 byte) reads using Unsafe, 
>>>> the address only has to be aligned to 4 bytes (8 is necessary 
>>>> alignment for 64 bit reads). So updateDirectByteBuffer could make 
>>>> this alignment on 4 bytes as it's only using 32 bit reads (with 
>>>> additional check on ADDRESS_SIZE, you could do that for updateBytes 
>>>> too).
>>>>
>>>> You don't get much out of it, so you decide if it's worth 
>>>> complication.
>>>>
>>>>
>>>>>
>>>>> //Staffan
>>>>>
>>>>> On 10/21/2014 10:28 AM, Staffan Friberg wrote:
>>>>>> Hi Peter,
>>>>>>
>>>>>> Thanks for the comments..
>>>>>>>
>>>>>>>   217                 if (Unsafe.ADDRESS_SIZE == 4) {
>>>>>>>   218                     // On 32 bit platforms read two ints 
>>>>>>> instead of a single 64bit long
>>>>>>>
>>>>>>> When you're reading from byte[] using Unsafe (updateBytes), you 
>>>>>>> have the option of reading 64bit values on 64bit platforms. When 
>>>>>>> you're reading from DirectByteBuffer memory 
>>>>>>> (updateDirectByteBuffer), you're only using 32bit reads.
>>>>>> I will add a comment in the code for this decision. The reason is 
>>>>>> that read a long results in slightly worse performance in this 
>>>>>> case, in updateBytes it is faster. I was able to get it to run 
>>>>>> slightly faster by working directly with the address instead of 
>>>>>> always adding address + off, but this makes things worse in the 
>>>>>> 32bit case since all calculation will now be using long 
>>>>>> variables. So using the getInt as in the current code feels like 
>>>>>> the best solution as it strikes the best balance between 32 and 
>>>>>> 64bit. Below is how updateByteBuffer looked with the rewrite I 
>>>>>> mentioned.
>>>>>>
>>>>>>
>>>>>>  ong address = ((DirectBuffer) buffer).address();
>>>>>>  crc = updateDirectByteBuffer(crc, address + pos, address + limit);
>>>>>>
>>>>>>
>>>>>>      private static int updateDirectByteBuffer(int crc, long adr, 
>>>>>> long end) {
>>>>>>
>>>>>>         // Do only byte reads for arrays so short they can't be 
>>>>>> aligned
>>>>>>         if (end - adr >= 8) {
>>>>>>
>>>>>>             // align on 8 bytes
>>>>>>             int alignLength = (8 - (int) (adr & 0x7)) & 0x7;
>>>>>>             for (long alignEnd = adr + alignLength; adr < 
>>>>>> alignEnd; adr++) {
>>>>>>                 crc = (crc >>> 8)
>>>>>>                         ^ byteTable[(crc ^ UNSAFE.getByte(adr)) & 
>>>>>> 0xFF];
>>>>>>             }
>>>>>>
>>>>>>             if (ByteOrder.nativeOrder() == ByteOrder.BIG_ENDIAN) {
>>>>>>                 crc = Integer.reverseBytes(crc);
>>>>>>             }
>>>>>>
>>>>>>             // slicing-by-8
>>>>>>             for (; adr < (end - Long.BYTES); adr += Long.BYTES) {
>>>>>>                 int firstHalf;
>>>>>>                 int secondHalf;
>>>>>>                 if (Unsafe.ADDRESS_SIZE == 4) {
>>>>>>                     // On 32 bit platforms read two ints instead 
>>>>>> of a single 64bit long
>>>>>>                     firstHalf = UNSAFE.getInt(adr);
>>>>>>                     secondHalf = UNSAFE.getInt(adr + Integer.BYTES);
>>>>>>                 } else {
>>>>>>                     long value = UNSAFE.getLong(adr);
>>>>>>                     if (ByteOrder.nativeOrder() == 
>>>>>> ByteOrder.LITTLE_ENDIAN) {
>>>>>>                         firstHalf = (int) value;
>>>>>>                         secondHalf = (int) (value >>> 32);
>>>>>>                     } else { // ByteOrder.BIG_ENDIAN
>>>>>>                         firstHalf = (int) (value >>> 32);
>>>>>>                         secondHalf = (int) value;
>>>>>>                     }
>>>>>>                 }
>>>>>>                 crc ^= firstHalf;
>>>>>>                 if (ByteOrder.nativeOrder() == 
>>>>>> ByteOrder.LITTLE_ENDIAN) {
>>>>>>                     crc = byteTable7[crc & 0xFF]
>>>>>>                             ^ byteTable6[(crc >>> 8) & 0xFF]
>>>>>>                             ^ byteTable5[(crc >>> 16) & 0xFF]
>>>>>>                             ^ byteTable4[crc >>> 24]
>>>>>>                             ^ byteTable3[secondHalf & 0xFF]
>>>>>>                             ^ byteTable2[(secondHalf >>> 8) & 0xFF]
>>>>>>                             ^ byteTable1[(secondHalf >>> 16) & 0xFF]
>>>>>>                             ^ byteTable0[secondHalf >>> 24];
>>>>>>                 } else { // ByteOrder.BIG_ENDIAN
>>>>>>                     crc = byteTable0[secondHalf & 0xFF]
>>>>>>                             ^ byteTable1[(secondHalf >>> 8) & 0xFF]
>>>>>>                             ^ byteTable2[(secondHalf >>> 16) & 0xFF]
>>>>>>                             ^ byteTable3[secondHalf >>> 24]
>>>>>>                             ^ byteTable4[crc & 0xFF]
>>>>>>                             ^ byteTable5[(crc >>> 8) & 0xFF]
>>>>>>                             ^ byteTable6[(crc >>> 16) & 0xFF]
>>>>>>                             ^ byteTable7[crc >>> 24];
>>>>>>                 }
>>>>>>             }
>>>>>>
>>>>>>             if (ByteOrder.nativeOrder() == ByteOrder.BIG_ENDIAN) {
>>>>>>                 crc = Integer.reverseBytes(crc);
>>>>>>             }
>>>>>>         }
>>>>>>
>>>>>>         // Tail
>>>>>>         for (; adr < end; adr++) {
>>>>>>             crc = (crc >>> 8)
>>>>>>                     ^ byteTable[(crc ^ UNSAFE.getByte(adr)) & 0xFF];
>>>>>>         }
>>>>>>
>>>>>>         return crc;
>>>>>>     }
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> Also, in updateBytes, the usage of 
>>>>>>> Unsafe.ARRAY_INT_INDEX_SCALE/ARRAY_LONG_INDEX_SCALE to index a 
>>>>>>> byte array sounds a little scary. To be ultra portable you could 
>>>>>>> check that ARRAY_BYTE_INDEX_SCALE == 1 first and refuse to use 
>>>>>>> Unsafe for byte arrays if it is not 1. Then use 
>>>>>>> Integer.BYTES/Long.BYTES to manipulate 'offsets' instead. In 
>>>>>>> updateDirectByteBuffer it would be more appropriate to use 
>>>>>>> Integer.BYTES/Long.BYTES too.
>>>>>> Good idea. Added a check in the initial if statement and it will 
>>>>>> get automatically optimized away.
>>>>>>
>>>>>>> 225 firstHalf = (int) (value & 0xFFFFFFFF);
>>>>>>>   226                         secondHalf = (int) (value >>> 32);
>>>>>>>   227                     } else { // ByteOrder.BIG_ENDIAN
>>>>>>>   228                         firstHalf = (int) (value >>> 32);
>>>>>>>   229                         secondHalf = (int) (value & 
>>>>>>> 0xFFFFFFFF);
>>>>>>>
>>>>>>> firstHalf = (int) value; // this is equivalent for line 225
>>>>>>> secondHalf = (int) value; // this is equivalent for line 229
>>>>>> Done.
>>>>>>
>>>>>> Here is the latest webrev, 
>>>>>> http://cr.openjdk.java.net/~sfriberg/JDK-6321472/webrev.03
>>>>>>
>>>>>> Cheers,
>>>>>> Staffan
>>>>>
>>>>
>>>
>>
>>
>