RFR (S): JDK-8191328: Avoid unnecessary overhead in CRC32C
Dmitry Chuyko
dmitry.chuyko at bell-sw.com
Tue Nov 28 15:35:47 UTC 2017
Initial version of the patch made worse C1 code because of additionally
introduced locals, this may be important for client (arm32). I fixed
this by just coupling xors with brackets. Also I made measurements with
Graal and AOT. Note, in case of tiered with AOT compiled java.base the
intrinsic is used if present.
Updated webrev: http://cr.openjdk.java.net/~dchuyko/8191328/webrev.01/
Updated benchmark:
http://cr.openjdk.java.net/~dchuyko/8191328/webrev.01/CRC32CAltBench.java
Results on my x86 laptop and JDK 10:
Tiered
before 375 ± 6 ns/op
after 334 ± 3 ns/op 11%
Tiered with Graal (JVMCI)
before 356 ± 7 ns/op
after 327 ± 6 ns/op 8%
Tiered with AOT compiled benchmark (non-tiered)
before 1308 ± 58 ns/op
after 1010 ± 8 ns/op 1.3x
Tiered with -XX:MaxInlineLevel=0
before 660 ± 4 ns/op
after 338 ± 3 ns/op 1.9x
C1
before 498 ± 4 ns/op
after 495 ± 4 ns/op same
Interpreter
before 40844 ± 333 ns/op
after 24777 ± 624 ns/op 1.7x
-Dmitry
On 11/16/2017 07:42 PM, Dmitry Chuyko wrote:
> On 11/15/2017 09:44 PM, Andrew Haley wrote:
>> On 15/11/17 18:38, Vitaly Davidovich wrote:
>>> On Wed, Nov 15, 2017 at 12:40 PM, Andrew Haley <aph at redhat.com> wrote:
>>>> On 15/11/17 15:38, Alan Bateman wrote:
>>>>> Moving the nativeOrder out of the loop make sense but I'm curious
>>>>> about
>>>>> the context for improving this implementation.
>>>> I wonder about lifting ByteOrder.nativeOrder(). Maybe it fails to
>>>> inline because the method is too large: if that happens, we really
>>>> lose. I'm not seeing that, though: it seems to be inlined just fine,
>>>> and has no effect.
> Sure, it is the effect of missing inlining. But you can relatively
> easily break it by your tiered JIT settings. Not sure about AOT. Like
> (in Hotspot):
> -XX:-Inline, -XX:MaxInlineLevel=0 (no wonder to meet this one in
> wild), -XX:FreqInlineSize=3, -XX:InlineSmallCode=15..
>>>>
>>>> In any case, this patch doesn't help anything on my test hardware.
>>> Is this with -Xcomp though? That can generate crap code because
>>> there's no profiling information. Not that -Xcomp should be the way
>>> to test peak performance IMO, but that is the setting that was used I
>>> believe.
> Another noticeable case is -Xint where absolute times of CRC
> calculation are quite long.
>
> Here is a benchmark that is easier to experiment with (no need to
> build jdk or to turn off intrinsics):
>
> http://cr.openjdk.java.net/~dchuyko/8191328/CRC32CAltBench.java
>
> Some x86 results:
>
> default tiered
> before 380.957 ± 11.621 ns/op
> after 350.838 ± 5.149 ns/op
>
> -XX:MaxInlineLevel=0
> before 656.791 ± 8.216 ns/op
> after 340.999 ± 2.686 ns/op
>
> -Xint
> before 36113.441 ± 197.716 ns/op
> after 26928.593 ± 133.309 ns/op
>
> -Dmitry
>
>> Shrug; maybe. We shouldn't mess the code up for -Xcomp.
>>
>
More information about the core-libs-dev
mailing list