From y.s.ramakrishna at oracle.com Tue Jul 12 10:28:00 2011 From: y.s.ramakrishna at oracle.com (Y. Srinivas Ramakrishna) Date: Tue, 12 Jul 2011 10:28:00 -0700 Subject: Young GC pause time definitions In-Reply-To: References: Message-ID: <4E1C8420.1080305@oracle.com> Hi Christian -- I've bcc'd gc-dev and cross-posted over to gc-use where this thread more naturally seems to belong ... I don't think it's the safepointing that's the problem here. It's probably a surviving data structure whose discovery/marking does not parallelize well. There were issues such as that with LBQ before, but as i recall they were fixed a while ago. As you noted, I'd support looking at survivor space overflow and seeing of larger survivor spaces and perhaps looking at the tenuring threshold sheds more light. Also two other things to try, to learn more about the issue (besides PrintTenuringDistribution): (1) +UseParNewGC -UseConcMarkSweepGC (i.e. turn off CMS to see how the scavenge times are affected: with both old and new versions of yr application) (2) check the ratio of elapsed time to (user + system) time during GC to see if GC parallelization is worse after upgrade of yr application More inline below:- On 7/12/2011 5:42 AM, Christian wrote: > Hi gc devs, > > I'm investigating young gc times that increased 10 fold after our > software was upgraded on a customer system. I've been unable to > reproduce something similar with a local test machine so I am unable > to experiment. For the customer, it used to be ~20ms young GCs and now > it is ~200ms long stops. > > I get the stop time by looking at the increment in the YGCT column of > "jstat -gc" output. > > After upgrade, there is more objects that survive into the CMS space. > It used to be 300 bytes (first i see something that low) and now it is > up to 600kb, on each young gc. This would explain more work done > during young gc. But can cpu bound work really explain 180ms extra to > get 600kb moved into CMS space? Especially as I don't see CPU usage > being a big difference before vs after. Look at the ratio of (system+user) to elapsed time. Perhaps it's because of some data structure (long, skinny) whose discovery/marking is not parallelizing well. Usually this does cause more system time because of a problem with task stealing. > > I'm on thin ice with this: failure to allocate an object causes a vm > safepoint, at which the vm runs a young gc, is that right? Could it Yes. > happen that it takes times to get all mutator threads to stop in this > safepoint before young gc can proceed? Unlikely (based on reading yr description), but ... > > If this is so, is there any way to learn how much time that is spent > getting threads to get into the safepoint, vs time to do gc work? And > what is measured by the young gc time from jstat or verbose gc? +PrintSafepointStatistics that you listed below (which is more informative in more recent JVM versions). > > The biggest change in our software is that we have reimplemented a big > part of our main execution to rely on blocking queues (from java util > concurrent) instead of an in-house queue implementation based on java > monitors. I've had some wild guesses that synchronization on blocking Note problem with LinkedBlockingQueues (i'll dig up the CR id) that caused GC behaviour similar to what you describe. Fixed in a later JDK (i'll dig up the details for you). > queues are less aware of having to get into a safepoint, because they > are a less-overhead synchronization mechanism. > > By scanning through this mailing list and other sites I have learned > about PrintSafepointStatistics, PrintGCApplicationStoppedTime& > PrintGCApplicationConcurrentTime. But I haven't found any detailed > explaination of what these values are. Potentially I could get the > customer to re-run tests with these, but as I don't know exactly what > the output means, I can't say for sure how I would use the information > gained from it. There's a header line that describes the values (but the printing was less than friendly in earlier versions; was fixed in a more recent version). For an explanation of the fields that did not make sense, check on the hotspot-runtime-dev list. > > The customer site is running an old jdk 1.6.0_14, with > -XX:+UseParNewGC and -XX:UseConcMarkSweepGC. Uses a 12 G heap, a > relatively small 512Mb new size. > > If I'm completely wrong in above guesses about how things work I will > suggest to them to increase the NewSize as it seems our new version of > the software fills upp newspace too quickly and short-lived objects > are promoted into the CMS heap too soon. But if the problem would be That could be part of the reason, but my guess is that if the LBQ issue is the problem, this will not help reduce scavenge times; on the contrary it may increase scavenge times because longer-surviving objects are copied more times between survivor spaces before being tenured. > that mutator threads dont enter the safepoint as quickly as with our > old software, then their long young gc pauses would remain. I really > want them to upgrade to a more recent jdk as I know 1.6.0_18 fixed > some CMS fragmentation bugs. -- ramki From y.s.ramakrishna at oracle.com Tue Jul 12 21:56:37 2011 From: y.s.ramakrishna at oracle.com (Y. Srinivas Ramakrishna) Date: Tue, 12 Jul 2011 21:56:37 -0700 Subject: Young GC pause time definitions In-Reply-To: References: <4E1C8420.1080305@oracle.com> Message-ID: <4E1D2585.9060906@oracle.com> On 7/12/2011 9:45 PM, Krystal Mok wrote: > On Wed, Jul 13, 2011 at 1:28 AM, Y. Srinivas Ramakrishna< > y.s.ramakrishna at oracle.com> wrote: >> >>> >>> The biggest change in our software is that we have reimplemented a big >>> part of our main execution to rely on blocking queues (from java util >>> concurrent) instead of an in-house queue implementation based on java >>> monitors. I've had some wild guesses that synchronization on blocking >> >> Note problem with LinkedBlockingQueues (i'll dig up the CR id) that caused >> GC behaviour similar to what you describe. Fixed in a later JDK (i'll dig >> up >> the details for you). > > > Is it this one: http://bugs.sun.com/view_bug.do?bug_id=6805775 > Bug ID 6805775: LinkedBlockingQueue Nodes should unlink themselves before > becoming garbage Yes, thanks Kris, that's the one I had in mind. Given that it's indicated as fixed in 6u19 and Christian is on 6u14, it's worth trying a JDK >= 6u19 and preferably the latest (6u26 or 6u27?). -- ramki From rednaxelafx at gmail.com Tue Jul 12 21:45:35 2011 From: rednaxelafx at gmail.com (Krystal Mok) Date: Wed, 13 Jul 2011 12:45:35 +0800 Subject: Young GC pause time definitions In-Reply-To: <4E1C8420.1080305@oracle.com> References: <4E1C8420.1080305@oracle.com> Message-ID: On Wed, Jul 13, 2011 at 1:28 AM, Y. Srinivas Ramakrishna < y.s.ramakrishna at oracle.com> wrote: > > > > > The biggest change in our software is that we have reimplemented a big > > part of our main execution to rely on blocking queues (from java util > > concurrent) instead of an in-house queue implementation based on java > > monitors. I've had some wild guesses that synchronization on blocking > > Note problem with LinkedBlockingQueues (i'll dig up the CR id) that caused > GC behaviour similar to what you describe. Fixed in a later JDK (i'll dig > up > the details for you). Is it this one: http://bugs.sun.com/view_bug.do?bug_id=6805775 Bug ID 6805775: LinkedBlockingQueue Nodes should unlink themselves before becoming garbage Regards, Kris Mok -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20110713/28a425e2/attachment.html