RFR (M): 8134553: CRC32C implementations for Nehalem x86/amd64 & Westmere+ x86/amd64

Tue Sep 8 18:16:47 UTC 2015

Updated review, including all of the changes you have pointed out in an original e-mail:
http://cr.openjdk.java.net/~mcberg/8134553/webrev.02/

--
Thank you,

Tomek

-----Original Message-----
From: Vladimir Kozlov [mailto:vladimir.kozlov at oracle.com] 
Sent: Wednesday, September 02, 2015 7:06 PM
To: Wojtowicz, Tomasz; 'hotspot-compiler-dev at openjdk.java.net'
Subject: Re: RFR (M): 8134553: CRC32C implementations for Nehalem x86/amd64 & Westmere+ x86/amd64

Nice work Tomek.
Few things.

Why there are no C1 changes?
TW>We were interested in speeding up an intrinsic path. I believe it is a similar approach to what ARM and SPARC choose.

templateInterpreter_x86_64.cpp has strange symbols in comments:
Java� Virtual
type�long�and�double�have
TW>I have spotted it earlier - looks like I have uploaded an earlier revision of a file than I originally thought . Fixed.

Code style comments:
If possible can you not use namespace? You can have named enums and 
local static methods instead.
TW>I got rid of all namespace CRC32C. Created named enum for chunk sizes.

And macroAssembler_x86.cpp totally goes against our code style.
I would suggest to move IPL_Alg4 and other local methods into 
MacroAssembler class - 'This->' looks strange here.
I would also rename all these local methods to have prefix crc32c_.
TW>Done.

Also align parameters lines to (:

+  void IPL_Alg4(INOUT(Register B), uint32_t n,
+                Scratch(Register C), Scratch(Register D), 
Scratch(Register Z),
+                MacroAssembler * This) {

+  CRC32C::ProcChunk(CRC32C::HIGH, CONSTOrPreCompConstIndex[0], 
CONSTOrPreCompConstIndex[1], IsPclmulqdqSupported,
+                   len, buf, crc,
+                   A, B, C,
+                   AXMM, BXMM, CXMM,
+                   D, E, F,
+                   this);
TW>Done.

Instead of Scratch(Register C) use Register tmp1.
I am fine with IN, INOUT, OUT, Scratch macros but Nehalem() and 
Westmere() is strange to see here - I do not understand their meaning.
TW>Nehalem and Westmere need different resources (Registers) to obtain carry less multiplication results. In order to not generate separate wrappers I thought about using single functions and then just marked registers which are specific to particular architecture with Westmere(), Nehalem() empty macros. In an effort to aid in understanding of a code. But I could get rid of them easily. Already done.
I got rid of a IN/INOUT/OUT macros and modified Register names accordingly in/in_out/out

rename CONSTOrPreCompConstInde* to use low case and _.
TW>Done.

Argument name: IsPclmulqdqSupported --> is_pclmulqdq_supported
TW>Done

stubRoutines_x86.cpp upper case name are used mostly for constants and 
enums. You have array PCLMULQDQ[] (I would change PCLMULQDQ name too - 
pclmulqdq_table) and next weird definition:
+  #undef CONST
+  static juint x;
+  #define CONST x
+      CONST = PowN[j];
Keep it local as X_CONST if you want to highlight that it is constant 
for following expression. Simple 'CONST' name is confusing.
>Both done

Rename  crc32c_IPL_Alg2Alt2Fast --> crc32c_IPL_alg2 (to you need alt fast?).
>I just need Alt2 since Gueron's Alg2 has two alternatives. Done.

Rename GenerateCRC32CTable --> generate_CRC32C_table
TW>Done

crc32c.h:
  // a) constants table generation 
(C:\Java\jdk9hs-comp\hotspot\src\cpu\x86\vm\stubRoutines_x86.cpp)
   62
Use hotspot/src/cpu/x86/vm/stubRoutines_x86.cpp
TW>Done

Thanks,
Vladimir

On 8/27/15 11:47 AM, Wojtowicz, Tomasz wrote:
> I would like to contribute following change:
>
> Review details
>
> Review Title:      CRC32C implementations for Nehalem x86/amd64 &
> Westmere+ x86/amd64
>
> Review ID:          #8134553
>
> Diff: http://cr.openjdk.java.net/~mcberg/8134553/webrev.01/
> <http://cr.openjdk.java.net/%7Emcberg/8134553/webrev.01/>
>
> Description:        Efficient use of a crc32 hardware instruction by
> division of a problem to a predefined chunks of an increasing size and
> further by 3 to be computed hiding instruction latencies. x86 delivers
> up to 8x improvement vs. java library, amd64 stops at even more -> 16x.
>
>                                  Performance data are attached to this
> message for your convenience.
>
>                                  No regressions has been observed on
> hotspot/compiler x86_64.
>
> Link: https://bugs.openjdk.java.net/browse/JDK-8134553
>
> Author:                Tomasz, Wojtowicz
>
> --
>
> Thank you,
>
> Tomek
>