RFR (S): JDK-8191328: Avoid unnecessary overhead in CRC32C

Tue Nov 28 15:35:47 UTC 2017

Initial version of the patch made worse C1 code because of additionally 
introduced locals, this may be important for client (arm32). I fixed 
this by just coupling xors with brackets. Also I made measurements with 
Graal and AOT. Note, in case of tiered with AOT compiled java.base the 
intrinsic is used if present.

Updated webrev: http://cr.openjdk.java.net/~dchuyko/8191328/webrev.01/
Updated benchmark: 
http://cr.openjdk.java.net/~dchuyko/8191328/webrev.01/CRC32CAltBench.java

Results on my x86 laptop and JDK 10:

Tiered
before  375 ± 6  ns/op
after   334 ± 3  ns/op 11%

Tiered with Graal (JVMCI)
before  356 ± 7  ns/op
after   327 ± 6  ns/op 8%

Tiered with AOT compiled benchmark (non-tiered)
before  1308 ± 58  ns/op
after   1010 ±  8  ns/op 1.3x

Tiered with -XX:MaxInlineLevel=0
before  660 ± 4  ns/op
after   338 ± 3  ns/op 1.9x

C1
before  498 ± 4  ns/op
after   495 ± 4  ns/op same

Interpreter
before  40844 ± 333  ns/op
after   24777 ± 624  ns/op 1.7x

-Dmitry

On 11/16/2017 07:42 PM, Dmitry Chuyko wrote:
> On 11/15/2017 09:44 PM, Andrew Haley wrote:
>> On 15/11/17 18:38, Vitaly Davidovich wrote:
>>> On Wed, Nov 15, 2017 at 12:40 PM, Andrew Haley <aph at redhat.com> wrote:
>>>> On 15/11/17 15:38, Alan Bateman wrote:
>>>>> Moving the nativeOrder out of the loop make sense but I'm curious 
>>>>> about
>>>>> the context for improving this implementation.
>>>> I wonder about lifting ByteOrder.nativeOrder().  Maybe it fails to
>>>> inline because the method is too large: if that happens, we really
>>>> lose.  I'm not seeing that, though: it seems to be inlined just fine,
>>>> and has no effect.
> Sure, it is the effect of missing inlining. But you can relatively 
> easily break it by your tiered JIT settings. Not sure about AOT. Like 
> (in Hotspot):
> -XX:-Inline, -XX:MaxInlineLevel=0 (no wonder to meet this one in 
> wild), -XX:FreqInlineSize=3, -XX:InlineSmallCode=15..
>>>>
>>>> In any case, this patch doesn't help anything on my test hardware.
>>> Is this with -Xcomp though? That can generate crap code because
>>> there's no profiling information.  Not that -Xcomp should be the way
>>> to test peak performance IMO, but that is the setting that was used I
>>> believe.
> Another noticeable case is -Xint where absolute times of CRC 
> calculation are quite long.
>
> Here is a benchmark that is easier to experiment with (no need to 
> build jdk or to turn off intrinsics):
>
> http://cr.openjdk.java.net/~dchuyko/8191328/CRC32CAltBench.java
>
> Some x86 results:
>
> default tiered
> before  380.957 ± 11.621  ns/op
> after   350.838 ±  5.149  ns/op
>
> -XX:MaxInlineLevel=0
> before  656.791 ± 8.216  ns/op
> after  340.999 ± 2.686  ns/op
>
> -Xint
> before  36113.441 ± 197.716  ns/op
> after   26928.593 ± 133.309  ns/op
>
> -Dmitry
>
>> Shrug; maybe.  We shouldn't mess the code up for -Xcomp.
>>
>