Spikes in duration of the minor GC caused by some unknown JVM activity before GC

Tue Dec 3 12:06:58 PST 2013

Well, the reason I asked is because of this line:
565117K->5057K(629120K), 0.0629070 secs] 1642325K->1082393K(2027264K),
2.0523080 secs] [Times: user=0.07 sys=0.00, real=2.05 secs]

Typically,  when you have user time completely out of whack with real time
(and low sys time), you have (1) gc threads not getting enough cpu time due
to other things being runnable, (2) severe stalls incurred by gc threads
due to something like cache misses or other hardware hazards, or (3) hard
page faults.

You have 4 instances running - only 1 user java thread? Also, JVM will
allocate a bunch of gc worker threads, and since you have 4 JVMs running
with each not knowing about the others, machine may get oversubscribed.

Are the other pauses also showing low user and sys time but high real time
(relative to other two)? Can you see if several of these JVMs are doing gcs
around same time when this pauses happen?

the sar output shows kind of high context switch count.

Sent from my phone
On Dec 3, 2013 2:11 PM, "Bartek Markocki" <bartosz.markocki at gmail.com>
wrote:

> yes, we have 4 instances on the box (bare-metal - 16 cores). The workload
> is distributed fairly equally.
>
> I checked all the instances and the issue can be found on all of them
> however the remaining instances showed the 'gaps' no longer than 200ms and
> the amount of occurrences is much lower (1 to  4).
>
> I also checked the sar statistics for the box and during the entire test
> they look very similar, i.e,
> CPU usage ~74%(user), ~10%(system), 0.15%(iowait)
> proc/s ~ 350
> cswch/s ~70000
> no swapping
> fault/s ~9000
> majflt/s~0.02
> runq-sz ~35
> plist-sz  ~ 2900
> ldavg-1  ~ 34
> ldavg-5  ~ 35
> ldavg-15 ~ 37
>
> Unfortunately the sar is set up to gather statistics with 10min bucket, so
> it is not the most accurate source of the data.
>
> Bartek
>
>
> On Tue, Dec 3, 2013 at 3:26 PM, Vitaly Davidovich <vitalyd at gmail.com>wrote:
>
>> Is anything else running on this host that consumes nontrivial cpu time?
>>
>> Sent from my phone
>> On Dec 3, 2013 7:51 AM, "Bartek Markocki" <bartosz.markocki at gmail.com>
>> wrote:
>>
>>> Hello all,
>>>
>>> We have been performing performance tests of a new release of one of our
>>> application. What caught our attention was sporadic spikes in the minor GC
>>> times.
>>>
>>> By default the application uses ParallelGC so we were unable to
>>> distinguish between enlarged minor GC times and something else. Therefore
>>> we switched to ParNew and added printing safepoint statistics together with
>>> tracing safepoint cleanup time.
>>>
>>> Based on the above we got data that shows that the duration of the STW
>>> is not driven by the duration of the minor GC but something else.
>>>
>>> Here is the extract from the GC log file:
>>> 2013-12-02T02:16:29.993-0600: 45476.110: Total time for which
>>> application threads were stopped: 0.0304740 seconds
>>> 2013-12-02T02:16:30.772-0600: 45476.889: Application time: 0.7794150
>>> seconds
>>> {Heap before GC invocations=53652 (full 13):
>>>  par new generation   total 629120K, used 565117K [0x538f0000,
>>> 0x7e390000, 0x7e390000)
>>>   eden space 559232K, 100% used [0x538f0000, 0x75b10000, 0x75b10000)
>>>   from space 69888K,   8% used [0x75b10000, 0x760cf500, 0x79f50000)
>>>   to   space 69888K,   0% used [0x79f50000, 0x79f50000, 0x7e390000)
>>>  tenured generation   total 1398144K, used 1077208K [0x7e390000,
>>> 0xd38f0000, 0xd38f0000)
>>>    the space 1398144K,  77% used [0x7e390000, 0xbff86080, 0xbff86200,
>>> 0xd38f0000)
>>>  compacting perm gen  total 262144K, used 97801K [0xd38f0000,
>>> 0xe38f0000, 0xf38f0000)
>>>    the space 262144K,  37% used [0xd38f0000, 0xd9872690, 0xd9872800,
>>> 0xe38f0000)
>>> No shared spaces configured.
>>> 2013-12-02T02:16:30.787-0600: 45476.904:
>>> [GC2013-12-02T02:16:32.776-0600: 45478.893: [ParNew
>>> Desired survivor size 35782656 bytes, new threshold 15 (max 15)
>>> - age   1:    1856072 bytes,    1856072 total
>>> - age   2:     170120 bytes,    2026192 total
>>> - age   3:     232696 bytes,    2258888 total
>>> - age   4:     180136 bytes,    2439024 total
>>> - age   5:     235120 bytes,    2674144 total
>>> - age   6:     242976 bytes,    2917120 total
>>> - age   7:     231728 bytes,    3148848 total
>>> - age   8:     149976 bytes,    3298824 total
>>> - age   9:     117904 bytes,    3416728 total
>>> - age  10:     126936 bytes,    3543664 total
>>> - age  11:     126624 bytes,    3670288 total
>>> - age  12:     114256 bytes,    3784544 total
>>> - age  13:     146760 bytes,    3931304 total
>>> - age  14:     163808 bytes,    4095112 total
>>> - age  15:     171664 bytes,    4266776 total
>>> : 565117K->5057K(629120K), 0.0629070 secs] 1642325K->1082393K(2027264K),
>>> 2.0523080 secs] [Times: user=0.07 sys=0.00, real=2.05 secs]
>>> Heap after GC invocations=53653 (full 13):
>>>  par new generation   total 629120K, used 5057K [0x538f0000, 0x7e390000,
>>> 0x7e390000)
>>>   eden space 559232K,   0% used [0x538f0000, 0x538f0000, 0x75b10000)
>>>   from space 69888K,   7% used [0x79f50000, 0x7a4404b8, 0x7e390000)
>>>   to   space 69888K,   0% used [0x75b10000, 0x75b10000, 0x79f50000)
>>>  tenured generation   total 1398144K, used 1077336K [0x7e390000,
>>> 0xd38f0000, 0xd38f0000)
>>>    the space 1398144K,  77% used [0x7e390000, 0xbffa6080, 0xbffa6200,
>>> 0xd38f0000)
>>>  compacting perm gen  total 262144K, used 97801K [0xd38f0000,
>>> 0xe38f0000, 0xf38f0000)
>>>    the space 262144K,  37% used [0xd38f0000, 0xd9872690, 0xd9872800,
>>> 0xe38f0000)
>>> No shared spaces configured.
>>> }
>>> 2013-12-02T02:16:32.839-0600: 45478.957: Total time for which
>>> application threads were stopped: 2.0675060 seconds
>>>
>>> Please notice the difference in the times of start of the STW and GC.
>>>
>>> And here is the output from the safepoint statistics:
>>> 45476.086: [deflating idle monitors, 0.0010070 secs]
>>> 45476.087: [updating inline caches, 0.0000000 secs]
>>> 45476.087: [compilation policy safepoint handler, 0.0005410 secs]
>>> 45476.088: [sweeping nmethods, 0.0000020 secs]
>>>          vmop                    [threads: total initially_running
>>> wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
>>> 45476.078: GenCollectForAllocation          [     294
>>> 3              6    ]      [     4     1     6     1    21    ]  3
>>> 45476.902: [deflating idle monitors, 0.0010780 secs]
>>> 45476.903: [updating inline caches, 0.0000010 secs]
>>> 45476.903: [compilation policy safepoint handler, 0.0005200 secs]
>>> 45476.904: [sweeping nmethods, 0.0000020 secs]
>>>          vmop                    [threads: total initially_running
>>> wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
>>> 45476.891: GenCollectForAllocation          [     294
>>> 2              3    ]      [     9     3    13     1  2052    ]  2
>>>
>>> So if my reading is correct we miss around 1.9 seconds - before the GC
>>> started.
>>> Based on the above it is unclear to me were this time was spent.
>>>
>>> The test was running for 40 hours.
>>> Over those 40 hours we had 26 incidents where the difference > 200ms.
>>> The above incident is the longest.
>>> Here they are:
>>> 3534.445: GenCollectForAllocation          [     307
>>> 0              2    ]      [     0     0     0     1  1935    ]  0
>>> 6403.822: GenCollectForAllocation          [     294
>>> 2              4    ]      [     0     0     0     1  1683    ]  1
>>> 9146.663: GenCollectForAllocation          [     292
>>> 2              3    ]      [    24     0    25     2  1773    ]  2
>>> 12351.684: GenCollectForAllocation          [     293
>>> 4             10    ]      [     2     0     2     1  2044    ]  2
>>> 15746.592: GenCollectForAllocation          [     294
>>> 1              2    ]      [     0     2     2     1  1697    ]  0
>>> 16574.963: GenCollectForAllocation          [     295
>>> 1              1    ]      [     0     0     0     1  1224    ]  0
>>> 18337.686: GenCollectForAllocation          [     293
>>> 1              2    ]      [     1     0     1     1   418    ]  0
>>> 19142.518: GenCollectForAllocation          [     295
>>> 0              1    ]      [     0     0     0     1  1626    ]  0
>>> 20563.826: GenCollectForAllocation          [     296
>>> 6              6    ]      [     7     2     9     1   233    ]  4
>>> 22611.752: GenCollectForAllocation          [     294
>>> 4              4    ]      [     0     0     0     1  1584    ]  1
>>> 26043.520: GenCollectForAllocation          [     295
>>> 2              6    ]      [     6     4    11     1  1883    ]  1
>>> 29584.480: GenCollectForAllocation          [     292
>>> 3              5    ]      [     0     0     0     1  1788    ]  2
>>> 33119.441: GenCollectForAllocation          [     293
>>> 2              4    ]      [     3     0     3     1  1853    ]  2
>>> 34800.660: GenCollectForAllocation          [     294
>>> 2              4    ]      [     0     0     0     1   725    ]  0
>>> 36444.246: GenCollectForAllocation          [     293
>>> 1              0    ]      [     0     0     4     1  1815    ]  0
>>> 36656.730: GenCollectForAllocation          [     294
>>> 1              3    ]      [     0     0     0     1   905    ]  0
>>> 39751.609: GenCollectForAllocation          [     294
>>> 2              4    ]      [     3     0     3     1  2207    ]  1
>>> 41836.305: GenCollectForAllocation          [     293
>>> 2              2    ]      [     1     0     1     1   286    ]  1
>>> 43323.840: GenCollectForAllocation          [     293
>>> 0              1    ]      [     0     0     0     1  2006    ]  0
>>> 45476.891: GenCollectForAllocation          [     294
>>> 2              3    ]      [     9     3    13     1  2052    ]  2
>>> 46288.453: GenCollectForAllocation          [     295
>>> 0              2    ]      [     0     4     5     1   211    ]  0
>>> 47016.430: GenCollectForAllocation          [     294
>>> 4              4    ]      [     0     0     0     1  2408    ]  0
>>> 48662.230: GenCollectForAllocation          [     293
>>> 1              4    ]      [     0     0     0     1   315    ]  1
>>> 48907.250: GenCollectForAllocation          [     296
>>> 3              6    ]      [     0     0     0     1   421    ]  0
>>> 50662.195: GenCollectForAllocation          [     294
>>> 3              4    ]      [     0     0     0     1  2043    ]  1
>>> 54128.828: GenCollectForAllocation          [     295
>>> 2              2    ]      [     1     1     3     2  2660    ]  2
>>> 57729.141: GenCollectForAllocation          [     298
>>> 1              4    ]      [     0     0     0     1  1926    ]  0
>>>
>>> We are on java7 u40, 32bit version on RHEL5.5.
>>> The JVM settings are:
>>> -server -Xmx2048m -Xms2048m -XX:PermSize=256m -XX:MaxPermSize=512m
>>> -Dsun.rmi.dgc.client.gcInterval=1800000
>>> -Dsun.rmi.dgc.server.gcInterval=1800000 -XX:+DisableExplicitGC
>>> -XX:+UseParNewGC -XX:MaxGCPauseMillis=3000
>>> -verbose:gc -Xloggc:... -XX:+PrintGCDetails -XX:+PrintHeapAtGC
>>> -XX:+PrintGCDateStamps -XX:+PrintTenuringDistribution
>>> -XX:+TraceSafepointCleanupTime -XX:+PrintSafepointStatistics
>>> -XX:PrintSafepointStatisticsCount=1 -XX:+PrintGCApplicationConcurrentTime
>>> -XX:+PrintGCApplicationStoppedTime
>>> -Djava.net.preferIPv4Stack=true
>>>
>>> The question of the day is: what else can we do to diagnose the above?
>>>
>>> Thanks in advance,
>>> Bartek
>>>
>>> _______________________________________________
>>> hotspot-gc-use mailing list
>>> hotspot-gc-use at openjdk.java.net
>>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>>>
>>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20131203/df0ce46a/attachment.html