Long "stop-the-world" pauses in CMS GC mode

David Tavoularis david.tavoularis at mycom-int.com
Tue Aug 30 02:52:18 PDT 2011


Hi Ramki, Zdeněk,

Thank you for your valuable answers.

> So the pause is associated not with GC work itself (which is correctly reported as 202 ms), but rather with a
> preamble to the GC, perhaps with bringing threads to a safepoint, I am guessing.
I will ask to add -XX:+PrintSafepointStatistics. What are the expected outputs ? Will it be in GC logs or in stdout ?

> you have a 48 core machine but you have handicapped yr JVM by forcing -XX:-CMSParallelRemarkEnabled,
> which forces single--threaded remarks on such a huge heap.
I will ask to remove it and let you know.

> If "ref proc" turns out to be still a problem, you can then enable +ParallelRefProcEnabled to parallelize that sub-phase as well.
I will not activate -XX:+ParallelRefProcEnabled, because according to http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=7028845, it is broken in Java6u25 and fixed in Java6u27.

> I have heard apocryphal stories of -XX:+UseMembar having worked to get rid of overly long safepointing pauses,
> and I have heard -XX:-UseBiasedLocking for pauses associated with bulk bias revocations.
Good to know, but I won't use them until I get more info from  -XX:+PrintSafepointStatistics and a new analysis after removing  -XX:-CMSParallelRemarkEnabled

>> The only suspicious thing is "[Times: user=5.67 sys=0.02, real=59.52 secs]", which means that the "real" duration is a lot higher than "user" CPU time.
>> Because "sys" duration is low, it also means that the server is not swapping.
> this can happen when the machine is overloaded. And as for swapping, I think it is not involved in the sys time because these times are times of the application thread.
In my experience, when server is swapping, the "sys" time duration is increasing a lot.
I can confirm that there is no high CPU load on the server (max CPU usage is 30% in the last 7 days) and no disk swapping (according to vmstat "sr"="scan rate" metrics).
According to Ramki, I need to understand the reason of slow safepoint action.

Best Regards
--
David Tavoularis
Head of L3 Support, Capacity & Dimensioning
Mycom Group

On Tue, 30 Aug 2011 09:19:52 +0200, Ramki Ramakrishna <y.s.ramakrishna at oracle.com> wrote:

David, I missed one of the longer pauses that you'd specifically drawn attention to:-

On 8/29/2011 2:25 AM, David Tavoularis wrote:
1. Here is the first pattern : a 61-second pause, but I don't see any suspicious message in GC logs:
...
2011-08-24T10:25:07.776+0100: 44242.537: [GC 44301.853: [ParNew
Desired survivor size 53673984 bytes, new threshold 1 (max 4)
- age 1: 99505080 bytes, 99505080 total
: 943744K->104832K(943744K), 0.2010508 secs] 21542906K->20852742K(50226816K), 0.2022636 secs] [Times: user=5.67 sys=0.02, real=59.52 secs]

If you look at the timestamps above, the GC event starts off at 44242.537 seconds, but then the GC itself does not
commence until 44301.853 seconds, i.e. a full 59.32 seconds later. So the pause is associated not with
GC work itself (which is correctly reported as 202 ms), but rather with a preamble to the GC, perhaps
with bringing threads to a safepoint, I am guessing. Once again -XX:+PrintSafepointStatistics (which
i mentioned in previous email wrt the 20 s pause in the middle of noweher) would likely provide
some clues. I have heard apocryphal stories of -XX:+UseMembar having worked to get rid of
overly long safepointing pauses,. and I have heard -XX:-UseBiasedLocking for pauses associated
with bulk bias revocations. But, without +PrintSafepointStatistics data to draw inferences
from, those incantations would just constitute superstitious mumbo-jumbo.

-- ramki

Heap after GC invocations=1188 (full 12):
par new generation total 943744K, used 104832K [0xfffffff353c00000, 0xfffffff393c00000, 0xfffffff393c00000)
eden space 838912K, 0% used [0xfffffff353c00000, 0xfffffff353c00000, 0xfffffff386f40000)
from space 104832K, 100% used [0xfffffff38d5a0000, 0xfffffff393c00000, 0xfffffff393c00000)
to space 104832K, 0% used [0xfffffff386f40000, 0xfffffff386f40000, 0xfffffff38d5a0000)
concurrent mark-sweep generation total 49283072K, used 20747910K [0xfffffff393c00000, 0xffffffff53c00000, 0xffffffff53c00000)
concurrent-mark-sweep perm gen total 524288K, used 42905K [0xffffffff53c00000, 0xffffffff73c00000, 0xffffffff73c00000)
}
Total time for which application threads were stopped: 61.5576519 seconds
Application time: 0.0245838 seconds
Total time for which application threads were stopped: 9.8331189 seconds
Application time: 0.0012626 seconds
Total time for which application threads were stopped: 0.0090404 seconds
Application time: 0.0008943 seconds
Total time for which application threads were stopped: 0.0020415 seconds
Application time: 0.0008181 seconds
Total time for which application threads were stopped: 0.2338605 seconds
Application time: 0.0018822 seconds

The only suspicious thing is "[Times: user=5.67 sys=0.02, real=59.52 secs]", which means that the "real" duration is a lot higher than "user" CPU time.
Because "sys" duration is low, it also means that the server is not swapping.
What could explain this 61 seconds pause ?




________________________________

This electronic message contains information from Mycom which may be privileged or confidential. The information is intended to be for the use of the individual(s) or entity named above. If you are not the intended recipient, be aware that any disclosure, copying, distribution or any other use of the contents of this information is prohibited. If you have received this electronic message in error, please notify us by post or telephone (to the numbers or correspondence address above) or by email (at the email address above) immediately.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20110830/3b7bd2ed/attachment.html 


More information about the hotspot-gc-use mailing list