Spikes in duration of the minor GC caused by some unknown JVM activity before GC

Fri Dec 6 06:59:44 PST 2013

Hi Vitaly,

Thanks for the hint about the real time vs user+sys. I rechecked the test
setup and environment. The more I dug the more things I was finding and
that resulted in the conclusion there were too many moving parts so anyone
can draw a meaningful conclusion from the test. So the test will be redone
in a more 'predictable' environment.

Despite the above I have found one more peculiar behavior which I do not
fully understand.
There were other box with other application involved in the test. RHEL 5.5
and java7 u10 64bit running CMS, two instances on the box. Both instances
got  promotion failure due to humongous allocation (~1M array). They did
not get failure during the same time.  When I looked closely at the
performance of the Full GC I have found this fellow:

53786.315: [GC 53786.315: [ParNew (0: promotion failure size = 3)  (1:
promotion failure size = 5)  (2: promotion failure size = 4)  (3: promotion
failure size = 5)  (4: promotion failure size = 3)  (5: promotion failure
size = 3)  (6: promotion failure size = 5)  (7: promotion failure size =
5)  (8: promotion failure size = 5)  (9: promotion failure size = 3)  (10:
promotion failure size = 5)  (11: promotion failure size = 4)  (12:
promotion failure size = 3)  (promotion failed):
1887398K->1811380K(1887488K), 67.8065550 secs]53854.122: [CMS:
1042845K->786493K(1048576K), 1.9985940 secs] 2835614K->786493K(2936064K),
[CMS Perm : 44209K->44206K(73852K)], 69.8061140 secs] [Times: user=70.20
sys=15.68, real=69.80 secs]

Other promotion failures looked similar to this one:
47287.350: [GC 47287.351: [ParNew (0: promotion failure size = 3)  (1:
promotion failure size = 4)  (2: promotion failure size = 4)  (3: promotion
failure size = 5)  (4: promotion failure size = 5)  (5: promotion failure
size = 4)  (6: promotion failure size = 5)  (7: promotion failure size =
5)  (8: promotion failure size = 3)  (9: promotion failure size = 3)  (10:
promotion failure size = 3)  (11: promotion failure size = 5)  (12:
promotion failure size = 4)  (promotion failed):
1887488K->1836533K(1887488K), 0.8193500 secs]47288.170: [CMS:
1033291K->779944K(1048576K), 2.1294690 secs] 2778468K->779944K(2936064K),
[CMS Perm : 44209K->44209K(73852K)], 2.9497890 secs] [Times: user=4.35
sys=0.08, real=2.95 secs]

I checked (for the time of the incident):
1. cpu utilization - no increase (stable on ~40%)
2. swap - no swapping
3. iostat,, sar - no significant change comparing to the time before and
after the incident
4. performance of the other instance on the box in terms of # transactions,
duration (including max) of the transactions and did not find a sign of any
slowdown

And even if I ignore the value for the real time, I cannot find explanation
for the value of the sys time.

Does it ring any bell to you? Is there anything else I can check to
understand that?

Thanks a lot!
Bartek

On Tue, Dec 3, 2013 at 9:06 PM, Vitaly Davidovich <vitalyd at gmail.com> wrote:

> Well, the reason I asked is because of this line:
>
> 565117K->5057K(629120K), 0.0629070 secs] 1642325K->1082393K(2027264K),
> 2.0523080 secs] [Times: user=0.07 sys=0.00, real=2.05 secs]
>
> Typically,  when you have user time completely out of whack with real time
> (and low sys time), you have (1) gc threads not getting enough cpu time due
> to other things being runnable, (2) severe stalls incurred by gc threads
> due to something like cache misses or other hardware hazards, or (3) hard
> page faults.
>
> You have 4 instances running - only 1 user java thread? Also, JVM will
> allocate a bunch of gc worker threads, and since you have 4 JVMs running
> with each not knowing about the others, machine may get oversubscribed.
>
> Are the other pauses also showing low user and sys time but high real time
> (relative to other two)? Can you see if several of these JVMs are doing gcs
> around same time when this pauses happen?
>
> the sar output shows kind of high context switch count.
>
> Sent from my phone
> On Dec 3, 2013 2:11 PM, "Bartek Markocki" <bartosz.markocki at gmail.com>
> wrote:
>
>> yes, we have 4 instances on the box (bare-metal - 16 cores). The workload
>> is distributed fairly equally.
>>
>> I checked all the instances and the issue can be found on all of them
>> however the remaining instances showed the 'gaps' no longer than 200ms and
>> the amount of occurrences is much lower (1 to  4).
>>
>> I also checked the sar statistics for the box and during the entire test
>> they look very similar, i.e,
>> CPU usage ~74%(user), ~10%(system), 0.15%(iowait)
>> proc/s ~ 350
>> cswch/s ~70000
>> no swapping
>> fault/s ~9000
>> majflt/s~0.02
>> runq-sz ~35
>> plist-sz  ~ 2900
>> ldavg-1  ~ 34
>> ldavg-5  ~ 35
>> ldavg-15 ~ 37
>>
>> Unfortunately the sar is set up to gather statistics with 10min bucket,
>> so it is not the most accurate source of the data.
>>
>> Bartek
>>
>>
>> On Tue, Dec 3, 2013 at 3:26 PM, Vitaly Davidovich <vitalyd at gmail.com>wrote:
>>
>>> Is anything else running on this host that consumes nontrivial cpu time?
>>>
>>> Sent from my phone
>>> On Dec 3, 2013 7:51 AM, "Bartek Markocki" <bartosz.markocki at gmail.com>
>>> wrote:
>>>
>>>> Hello all,
>>>>
>>>> We have been performing performance tests of a new release of one of
>>>> our application. What caught our attention was sporadic spikes in the minor
>>>> GC times.
>>>>
>>>> By default the application uses ParallelGC so we were unable to
>>>> distinguish between enlarged minor GC times and something else. Therefore
>>>> we switched to ParNew and added printing safepoint statistics together with
>>>> tracing safepoint cleanup time.
>>>>
>>>> Based on the above we got data that shows that the duration of the STW
>>>> is not driven by the duration of the minor GC but something else.
>>>>
>>>> Here is the extract from the GC log file:
>>>> 2013-12-02T02:16:29.993-0600: 45476.110: Total time for which
>>>> application threads were stopped: 0.0304740 seconds
>>>> 2013-12-02T02:16:30.772-0600: 45476.889: Application time: 0.7794150
>>>> seconds
>>>> {Heap before GC invocations=53652 (full 13):
>>>>  par new generation   total 629120K, used 565117K [0x538f0000,
>>>> 0x7e390000, 0x7e390000)
>>>>   eden space 559232K, 100% used [0x538f0000, 0x75b10000, 0x75b10000)
>>>>   from space 69888K,   8% used [0x75b10000, 0x760cf500, 0x79f50000)
>>>>   to   space 69888K,   0% used [0x79f50000, 0x79f50000, 0x7e390000)
>>>>  tenured generation   total 1398144K, used 1077208K [0x7e390000,
>>>> 0xd38f0000, 0xd38f0000)
>>>>    the space 1398144K,  77% used [0x7e390000, 0xbff86080, 0xbff86200,
>>>> 0xd38f0000)
>>>>  compacting perm gen  total 262144K, used 97801K [0xd38f0000,
>>>> 0xe38f0000, 0xf38f0000)
>>>>    the space 262144K,  37% used [0xd38f0000, 0xd9872690, 0xd9872800,
>>>> 0xe38f0000)
>>>> No shared spaces configured.
>>>> 2013-12-02T02:16:30.787-0600: 45476.904:
>>>> [GC2013-12-02T02:16:32.776-0600: 45478.893: [ParNew
>>>> Desired survivor size 35782656 bytes, new threshold 15 (max 15)
>>>> - age   1:    1856072 bytes,    1856072 total
>>>> - age   2:     170120 bytes,    2026192 total
>>>> - age   3:     232696 bytes,    2258888 total
>>>> - age   4:     180136 bytes,    2439024 total
>>>> - age   5:     235120 bytes,    2674144 total
>>>> - age   6:     242976 bytes,    2917120 total
>>>> - age   7:     231728 bytes,    3148848 total
>>>> - age   8:     149976 bytes,    3298824 total
>>>> - age   9:     117904 bytes,    3416728 total
>>>> - age  10:     126936 bytes,    3543664 total
>>>> - age  11:     126624 bytes,    3670288 total
>>>> - age  12:     114256 bytes,    3784544 total
>>>> - age  13:     146760 bytes,    3931304 total
>>>> - age  14:     163808 bytes,    4095112 total
>>>> - age  15:     171664 bytes,    4266776 total
>>>> : 565117K->5057K(629120K), 0.0629070 secs]
>>>> 1642325K->1082393K(2027264K), 2.0523080 secs] [Times: user=0.07 sys=0.00,
>>>> real=2.05 secs]
>>>> Heap after GC invocations=53653 (full 13):
>>>>  par new generation   total 629120K, used 5057K [0x538f0000,
>>>> 0x7e390000, 0x7e390000)
>>>>   eden space 559232K,   0% used [0x538f0000, 0x538f0000, 0x75b10000)
>>>>   from space 69888K,   7% used [0x79f50000, 0x7a4404b8, 0x7e390000)
>>>>   to   space 69888K,   0% used [0x75b10000, 0x75b10000, 0x79f50000)
>>>>  tenured generation   total 1398144K, used 1077336K [0x7e390000,
>>>> 0xd38f0000, 0xd38f0000)
>>>>    the space 1398144K,  77% used [0x7e390000, 0xbffa6080, 0xbffa6200,
>>>> 0xd38f0000)
>>>>  compacting perm gen  total 262144K, used 97801K [0xd38f0000,
>>>> 0xe38f0000, 0xf38f0000)
>>>>    the space 262144K,  37% used [0xd38f0000, 0xd9872690, 0xd9872800,
>>>> 0xe38f0000)
>>>> No shared spaces configured.
>>>> }
>>>> 2013-12-02T02:16:32.839-0600: 45478.957: Total time for which
>>>> application threads were stopped: 2.0675060 seconds
>>>>
>>>> Please notice the difference in the times of start of the STW and GC.
>>>>
>>>> And here is the output from the safepoint statistics:
>>>> 45476.086: [deflating idle monitors, 0.0010070 secs]
>>>> 45476.087: [updating inline caches, 0.0000000 secs]
>>>> 45476.087: [compilation policy safepoint handler, 0.0005410 secs]
>>>> 45476.088: [sweeping nmethods, 0.0000020 secs]
>>>>          vmop                    [threads: total initially_running
>>>> wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
>>>> 45476.078: GenCollectForAllocation          [     294
>>>> 3              6    ]      [     4     1     6     1    21    ]  3
>>>> 45476.902: [deflating idle monitors, 0.0010780 secs]
>>>> 45476.903: [updating inline caches, 0.0000010 secs]
>>>> 45476.903: [compilation policy safepoint handler, 0.0005200 secs]
>>>> 45476.904: [sweeping nmethods, 0.0000020 secs]
>>>>          vmop                    [threads: total initially_running
>>>> wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
>>>> 45476.891: GenCollectForAllocation          [     294
>>>> 2              3    ]      [     9     3    13     1  2052    ]  2
>>>>
>>>> So if my reading is correct we miss around 1.9 seconds - before the GC
>>>> started.
>>>> Based on the above it is unclear to me were this time was spent.
>>>>
>>>> The test was running for 40 hours.
>>>> Over those 40 hours we had 26 incidents where the difference > 200ms.
>>>> The above incident is the longest.
>>>> Here they are:
>>>> 3534.445: GenCollectForAllocation          [     307
>>>> 0              2    ]      [     0     0     0     1  1935    ]  0
>>>> 6403.822: GenCollectForAllocation          [     294
>>>> 2              4    ]      [     0     0     0     1  1683    ]  1
>>>> 9146.663: GenCollectForAllocation          [     292
>>>> 2              3    ]      [    24     0    25     2  1773    ]  2
>>>> 12351.684: GenCollectForAllocation          [     293
>>>> 4             10    ]      [     2     0     2     1  2044    ]  2
>>>> 15746.592: GenCollectForAllocation          [     294
>>>> 1              2    ]      [     0     2     2     1  1697    ]  0
>>>> 16574.963: GenCollectForAllocation          [     295
>>>> 1              1    ]      [     0     0     0     1  1224    ]  0
>>>> 18337.686: GenCollectForAllocation          [     293
>>>> 1              2    ]      [     1     0     1     1   418    ]  0
>>>> 19142.518: GenCollectForAllocation          [     295
>>>> 0              1    ]      [     0     0     0     1  1626    ]  0
>>>> 20563.826: GenCollectForAllocation          [     296
>>>> 6              6    ]      [     7     2     9     1   233    ]  4
>>>> 22611.752: GenCollectForAllocation          [     294
>>>> 4              4    ]      [     0     0     0     1  1584    ]  1
>>>> 26043.520: GenCollectForAllocation          [     295
>>>> 2              6    ]      [     6     4    11     1  1883    ]  1
>>>> 29584.480: GenCollectForAllocation          [     292
>>>> 3              5    ]      [     0     0     0     1  1788    ]  2
>>>> 33119.441: GenCollectForAllocation          [     293
>>>> 2              4    ]      [     3     0     3     1  1853    ]  2
>>>> 34800.660: GenCollectForAllocation          [     294
>>>> 2              4    ]      [     0     0     0     1   725    ]  0
>>>> 36444.246: GenCollectForAllocation          [     293
>>>> 1              0    ]      [     0     0     4     1  1815    ]  0
>>>> 36656.730: GenCollectForAllocation          [     294
>>>> 1              3    ]      [     0     0     0     1   905    ]  0
>>>> 39751.609: GenCollectForAllocation          [     294
>>>> 2              4    ]      [     3     0     3     1  2207    ]  1
>>>> 41836.305: GenCollectForAllocation          [     293
>>>> 2              2    ]      [     1     0     1     1   286    ]  1
>>>> 43323.840: GenCollectForAllocation          [     293
>>>> 0              1    ]      [     0     0     0     1  2006    ]  0
>>>> 45476.891: GenCollectForAllocation          [     294
>>>> 2              3    ]      [     9     3    13     1  2052    ]  2
>>>> 46288.453: GenCollectForAllocation          [     295
>>>> 0              2    ]      [     0     4     5     1   211    ]  0
>>>> 47016.430: GenCollectForAllocation          [     294
>>>> 4              4    ]      [     0     0     0     1  2408    ]  0
>>>> 48662.230: GenCollectForAllocation          [     293
>>>> 1              4    ]      [     0     0     0     1   315    ]  1
>>>> 48907.250: GenCollectForAllocation          [     296
>>>> 3              6    ]      [     0     0     0     1   421    ]  0
>>>> 50662.195: GenCollectForAllocation          [     294
>>>> 3              4    ]      [     0     0     0     1  2043    ]  1
>>>> 54128.828: GenCollectForAllocation          [     295
>>>> 2              2    ]      [     1     1     3     2  2660    ]  2
>>>> 57729.141: GenCollectForAllocation          [     298
>>>> 1              4    ]      [     0     0     0     1  1926    ]  0
>>>>
>>>> We are on java7 u40, 32bit version on RHEL5.5.
>>>> The JVM settings are:
>>>> -server -Xmx2048m -Xms2048m -XX:PermSize=256m -XX:MaxPermSize=512m
>>>> -Dsun.rmi.dgc.client.gcInterval=1800000
>>>> -Dsun.rmi.dgc.server.gcInterval=1800000 -XX:+DisableExplicitGC
>>>> -XX:+UseParNewGC -XX:MaxGCPauseMillis=3000
>>>> -verbose:gc -Xloggc:... -XX:+PrintGCDetails -XX:+PrintHeapAtGC
>>>> -XX:+PrintGCDateStamps -XX:+PrintTenuringDistribution
>>>> -XX:+TraceSafepointCleanupTime -XX:+PrintSafepointStatistics
>>>> -XX:PrintSafepointStatisticsCount=1 -XX:+PrintGCApplicationConcurrentTime
>>>> -XX:+PrintGCApplicationStoppedTime
>>>> -Djava.net.preferIPv4Stack=true
>>>>
>>>> The question of the day is: what else can we do to diagnose the above?
>>>>
>>>> Thanks in advance,
>>>> Bartek
>>>>
>>>> _______________________________________________
>>>> hotspot-gc-use mailing list
>>>> hotspot-gc-use at openjdk.java.net
>>>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>>>>
>>>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20131206/7140983f/attachment.html