AW: ridiculous ParNew pause times

Wed Apr 9 16:40:10 UTC 2014

Bingo!
First, i tried to get a tool which shows me which process writes to which file and how long that takes, but I was unable to find one I could master to use.
Based on your suggestion I moved the log.gc file to a ramdisk and performed extensive load testing - now my biggest outlier is 
2014-04-09T18:29:13.873+0200: 383.372: [GC2014-04-09T18:29:13.874+0200: 383.372: [ParNew: 431041K->208698K(3145728K), 1.8312460 secs] 11055599K->11203421K(21495808K), 1.8315130 secs] [Times: user=2.61 sys=0.03, real=1.83 secs]
which is okay.
When moving log.gc back to the harddisk to which redis saves I get pauses as long as 45sec.

So I agree with Vitaly that redis probably thrashes some caches (though not all, as it only occupies one cpu core of a dual-socket server) which would explain a slowdown of garbage collections, but the real cause of the long spikes is actually writing the log to disk.

Could you guys check if you write the gc log synchronous?
If yes, I'd suggest (a) moving writing the log out of the STW phase and (b) using asynchronous I/O for writing the log.

Thanks for all your suggestions and your help!

Regards,
Cornelius Riemenschneider
--
ITscope GmbH
Ludwig-Erhard-Allee 20
76131 Karlsruhe
Email: cornelius.riemenschneider at itscope.de
https://www.itscope.com
Handelsregister: AG Mannheim, HRB 232782
Sitz der Gesellschaft: Karlsruhe
Geschäftsführer: Alexander Münkel, Benjamin Mund, Stefan Reger

-----Ursprüngliche Nachricht-----
Von: Wolfgang Pedot [mailto:wolfgang.pedot at finkzeit.at] 
Gesendet: Mittwoch, 9. April 2014 17:29
An: Vitaly Davidovich
Cc: Cornelius Riemenschneider; hotspot-gc-use
Betreff: Re: ridiculous ParNew pause times

A crazy idea: could the gc-logging be the problem here because the redis-dump is using up all the IO bandwidth?

Wolfgang

Am 09.04.2014 17:06, schrieb Vitaly Davidovich:
> Interesting.  How much RAM does this machine have? You're using about 
> 21-22g for jvm and another 5g for redis, right? Is there room to spare 
> in physical memory?
>
> If there's no swap activity, then I don't think there's any I/O 
> interference.  I'm guessing that redis writing out 5g of memory is 
> blowing the cpu caches for the jvm, and causing page faults.  Although 
> if there's no swap activity, it seems these faults would be soft and 
> it's hard to explain how that would balloon to 30 sec pauses and not 
> show up in sys time then.
>
> Are there lots of context switches in the java process when this happens?
>
> Sent from my phone
>
> On Apr 9, 2014 10:41 AM, "Cornelius Riemenschneider" <cri at itscope.de 
> <mailto:cri at itscope.de>> wrote:
>
>     Hey,____
>
>     thanks for the hints :-)____
>
>     The server runs one JVM and one redis instance (A event-based,
>     single threaded in-memory nosql-datastore).____
>
>     Redis stores about 5G of data, which are written to disk from time
>     to time ? it now turns out, that the redis saves align perfectly
>     with our long ParNew times.____
>
>     By initiating a redis save I was able to get these garbage
>     collections.____
>
>     2014-04-09T15:06:07.892+0200: 85043.119:
>     [GC2014-04-09T15:06:18.089+0200: 85053.315: [ParNew:
>     2964426K->338263K(3145728K), 0.1532950 secs]
>     7167668K->4541505K(21495808K), 10.3502410 secs] [Times: user=1.93
>     sys=0.01, real=10.35 secs]____
>
>     2014-04-09T15:06:39.203+0200: 85074.429:
>     [GC2014-04-09T15:07:11.026+0200: 85106.252: [ParNew:
>     3145728K->405516K(3145728K), 0.3429470 secs]
>     7645283K->5212053K(21495808K), 32.1663210 secs] [Times: user=2.96
>     sys=0.01, real=32.17 secs]____
>
>     I monitored the system during the GCs and swapping definitly does
>     not happen.____
>
>     Because redis is single-threaded (and only used by the JVM which is
>     during ParNew in a STW phase), during ParNew there is only redis
>     active, trying to write 5G of memory to disk as fast as possible and
>     the GC.____
>
>     I wasn?t able to pinpoint the issue yet, do you have an idea why the
>     jvm could block on I/O during a GC?____
>
>     Disk access is during these phases probably way slower than usual,
>     as everything is on the / partition, a RAID 1 disk array.____
>
>     Or does anyone know the right perf options for  the perf tool to
>     monitor blocking i/o during garbage collections?____
>
>     __ __
>
>     Regards,____
>
>     Cornelius Riemenschneider____
>
>     --____
>
>     ITscope GmbH____
>
>     Ludwig-Erhard-Allee 20____
>
>     76131 Karlsruhe____
>
>     Email: cornelius.riemenschneider at itscope.de
>     <mailto:cornelius.riemenschneider at itscope.de>____
>
>     https://www.itscope.com____
>
>     Handelsregister: AG Mannheim, HRB 232782____
>
>     Sitz der Gesellschaft: Karlsruhe____
>
>     Geschäftsführer: Alexander Münkel, Benjamin Mund, Stefan Reger____
>
>     __ __
>
>     *Von:*Vitaly Davidovich [mailto:vitalyd at gmail.com
>     <mailto:vitalyd at gmail.com>]
>     *Gesendet:* Mittwoch, 9. April 2014 14:49
>     *An:* Cornelius Riemenschneider
>     *Cc:* hotspot-gc-use
>     *Betreff:* Re: ridiculous ParNew pause times____
>
>     __ __
>
>     This is typically caused by one of:____
>
>     1) heavy swapping (this won't count towards sys time because kernel
>     is not using cpu while waiting for I/O to complete)
>     2) oversubscribed machine where gc threads aren't getting enough cpu
>     time due____
>
>     Have you looked at stats on the machine when these pauses occur,
>     specifically around swap activity? Is your machine running multiple
>     JVMs or any other noisy neighbor apps?____
>
>     Sent from my phone____
>
>     On Apr 9, 2014 7:26 AM, "Cornelius Riemenschneider" <cri at itscope.de
>     <mailto:cri at itscope.de>> wrote:____
>
>     Hello,____
>
>     we?ve got the following problem with the ParNew collector:____
>
>     Our log.gc usually looks like this:____
>
>     2014-04-09T12:58:02.485+0200: 77357.712:
>     [GC2014-04-09T12:58:02.485+0200: 77357.712: [ParNew:
>     2722925K->100795K(3145728K), 0.3202010 secs]
>     6998057K->4375934K(21495808K), 0.3205670 secs] [Times: user=4.10
>     sys=0.02, real=0.32 secs] ____
>
>     2014-04-09T12:58:06.256+0200: 77361.483:
>     [GC2014-04-09T12:58:06.257+0200: 77361.483: [ParNew:
>     2722235K->101011K(3145728K), 0.3229910 secs]
>     6997374K->4376165K(21495808K), 0.3233580 secs] [Times: user=4.13
>     sys=0.02, real=0.33 secs] ____
>
>     2014-04-09T12:58:12.295+0200: 77367.522:
>     [GC2014-04-09T12:58:12.296+0200: 77367.522: [ParNew:
>     2722451K->101057K(3145728K), 0.3215320 secs]
>     6997605K->4376216K(21495808K), 0.3219080 secs] [Times: user=4.12
>     sys=0.01, real=0.32 secs] ____
>
>     2014-04-09T12:58:18.461+0200: 77373.688:
>     [GC2014-04-09T12:58:18.462+0200: 77373.688: [ParNew:
>     2722497K->2232K(3145728K), 0.2944540 secs]
>     6997656K->4376242K(21495808K), 0.2948280 secs] [Times: user=3.79
>     sys=0.00, real=0.29 secs]____
>
>     But occasionally we have entries like these:____
>
>     2014-04-09T09:56:12.840+0200: 66448.066:
>     [GC2014-04-09T09:56:38.154+0200: 66473.381: [ParNew:
>     3139808K->524288K(3145728K), 0.8355990 secs]
>     6845512K->4585563K(21495808K), 26.1502640 secs] [Times: user=5.59
>     sys=0.16, real=26.16 secs] ____
>
>     2014-04-09T09:57:09.173+0200: 66504.400:
>     [GC2014-04-09T09:57:24.871+0200: 66520.098: [ParNew:
>     2950573K->488555K(3145728K), 0.1876660 secs]
>     8701082K->6239064K(21495808K), 15.8858250 secs] [Times: user=2.38
>     sys=0.00, real=15.89 secs] ____
>
>     2014-04-09T12:58:34.661+0200: 77389.888:
>     [GC2014-04-09T12:59:06.390+0200: 77421.616: [ParNew:
>     2623439K->2083K(3145728K), 0.0292270 secs]
>     6997709K->4376490K(21495808K), 31.7578950 secs] [Times: user=0.34
>     sys=0.02, real=31.76 secs] ____
>
>     which I can?t explain at all.____
>
>     The real time of 31.76sec equals a pause of 31.76secs, in which the
>     jvm does not respond to user requests, which is obviously bad.____
>
>     The application is _/very/_ allocation heavy, so generally pauses of
>     0.3sec are okay.____
>
>     Our GC settings for this server are:____
>
>     -Xms21g____
>
>     -Xmx21g____
>
>     -XX:ReservedCodeCacheSize=256m____
>
>     -XX:PermSize=256m____
>
>     -XX:MaxPermSize=768m____
>
>     -server____
>
>     -verbose:gc____
>
>     -Xloggc:log.gc____
>
>     -XX:+PrintGCDetails____
>
>     -XX:+PrintGCDateStamps____
>
>     -XX:+ExplicitGCInvokesConcurrent____
>
>     -XX:NewRatio=5____
>
>     -XX:SurvivorRatio=5____
>
>     -XX:+UseConcMarkSweepGC____
>
>     -XX:+UseParNewGC____
>
>     -XX:+UseCMSInitiatingOccupancyOnly____
>
>     -XX:CMSInitiatingOccupancyFraction=40____
>
>     -XX:+CMSClassUnloadingEnabled____
>
>     -XX:+CMSScavengeBeforeRemark____
>
>     -Dsun.rmi.dgc.client.gcInterval=1209600000____
>
>     -Dsun.rmi.dgc.server.gcInterval=1209600000____
>
>     ____
>
>     We run the sun jdk 7u51 on a current debian wheezy.____
>
>     We previously had issues with long ParNew pauses, but back then, the
>     sys time was high, so we concluded that the server was 
> swapping,____
>
>     which we were able to prevent.____
>
>     Do you have any idea or further hint at debugging options which
>     might help us in finding the issue?____
>
>     ____
>
>     Regards,____
>
>     Cornelius Riemenschneider____
>
>     --____
>
>     ITscope GmbH____
>
>     Ludwig-Erhard-Allee 20____
>
>     76131 Karlsruhe____
>
>     Email: cornelius.riemenschneider at itscope.de
>     <mailto:cornelius.riemenschneider at itscope.de>____
>
>     https://www.itscope.com____
>
>     Handelsregister: AG Mannheim, HRB 232782____
>
>     Sitz der Gesellschaft: Karlsruhe____
>
>     Geschäftsführer: Alexander Münkel, Benjamin Mund, Stefan Reger____
>
>     ____
>
>
>     _______________________________________________
>     hotspot-gc-use mailing list
>     hotspot-gc-use at openjdk.java.net <mailto:hotspot-gc-use at openjdk.java.net>
>     http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use____
>
>
>
> _______________________________________________
> hotspot-gc-use mailing list
> hotspot-gc-use at openjdk.java.net
> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>