From y.s.ramakrishna at oracle.com  Tue Jul 12 10:28:00 2011
From: y.s.ramakrishna at oracle.com (Y. Srinivas Ramakrishna)
Date: Tue, 12 Jul 2011 10:28:00 -0700
Subject: Young GC pause time definitions
In-Reply-To: <CAJmT7L8OiDdYRbwrMy_GTc6tG04qzNC35sqeWd+D1VCHU5_yRQ@mail.gmail.com>
References: <CAJmT7L8OiDdYRbwrMy_GTc6tG04qzNC35sqeWd+D1VCHU5_yRQ@mail.gmail.com>
Message-ID: <4E1C8420.1080305@oracle.com>

Hi Christian --

I've bcc'd gc-dev and cross-posted over to gc-use where this
thread more naturally seems to belong ...

I don't think it's the safepointing that's the problem here.
It's probably a surviving data structure whose discovery/marking
does not parallelize well. There were issues such as that with
LBQ before, but as i recall they were fixed a while ago.

As you noted, I'd support looking at survivor space overflow and
seeing of larger survivor spaces and perhaps looking at the
tenuring threshold sheds more light.

Also two other things to try, to learn more about the issue
(besides PrintTenuringDistribution):
(1) +UseParNewGC -UseConcMarkSweepGC (i.e. turn off CMS to see
     how the scavenge times are affected: with both old and new versions
     of yr application)
(2) check the ratio of elapsed time to (user + system) time during
     GC to see if GC parallelization is worse after
     upgrade of yr application

More inline below:-

On 7/12/2011 5:42 AM, Christian wrote:
> Hi gc devs,
>
> I'm investigating young gc times that increased 10 fold after our
> software was upgraded on a customer system. I've been unable to
> reproduce something similar with a local test machine so I am unable
> to experiment. For the customer, it used to be ~20ms young GCs and now
> it is ~200ms long stops.
>
> I get the stop time  by looking at the increment in the YGCT column of
> "jstat -gc" output.
>
> After upgrade, there is more objects that survive into the CMS space.
> It used to be 300 bytes (first i see something that low) and now it is
> up to 600kb, on each young gc. This would explain more work done
> during young gc. But can cpu bound work really explain 180ms extra to
> get 600kb moved into CMS space? Especially as I don't see CPU usage
> being a big difference before vs after.

Look at the ratio of (system+user) to elapsed time. Perhaps it's
because of some data structure (long, skinny) whose discovery/marking
is not parallelizing well. Usually this does cause more system time
because of a problem with task stealing.

>
> I'm on thin ice with this: failure to allocate an object causes a vm
> safepoint, at which the vm runs a young gc, is that right?  Could it

Yes.

> happen that it takes times to get all mutator threads to stop in this
> safepoint before young gc can proceed?

Unlikely (based on reading yr description), but ...

>
> If this is so, is there any way to learn how much time that is spent
> getting threads to get into the safepoint, vs time to do gc work? And
> what is measured by the young gc time from jstat or verbose gc?

+PrintSafepointStatistics that you listed below (which is more informative
in more recent JVM versions).

>
> The biggest change in our software is that we have reimplemented a big
> part of our main execution to rely on blocking queues (from java util
> concurrent) instead of an in-house queue implementation based on java
> monitors. I've had some wild guesses that synchronization on blocking

Note problem with LinkedBlockingQueues (i'll dig up the CR id) that caused
GC behaviour similar to what you describe. Fixed in a later JDK (i'll dig up
the details for you).

> queues are less aware of having to get into a safepoint, because they
> are a less-overhead synchronization mechanism.
>
> By scanning through this mailing list and other sites I have learned
> about PrintSafepointStatistics, PrintGCApplicationStoppedTime&
> PrintGCApplicationConcurrentTime. But I haven't found any detailed
> explaination of what these values are. Potentially I could get the
> customer to re-run tests with these, but as I don't know exactly what
> the output means, I can't say for sure how I would use the information
> gained from it.

There's a header line that describes the values (but the printing
was less than friendly in earlier versions; was fixed in a more recent
version). For an explanation of the fields that did not make sense,
check on the hotspot-runtime-dev list.

>
> The customer site is running an old jdk 1.6.0_14, with
> -XX:+UseParNewGC and -XX:UseConcMarkSweepGC. Uses a 12 G heap, a
> relatively small 512Mb new size.
>
> If I'm completely wrong in above guesses about how things work I will
> suggest to them to increase the NewSize as it seems our new version of
> the software fills upp newspace too quickly and short-lived objects
> are promoted into the CMS heap too soon. But if the problem would be

That could be part of the reason, but my guess is that if the LBQ issue
is the problem, this will not help reduce scavenge times; on the contrary
it may increase scavenge times because longer-surviving objects are
copied more times between survivor spaces before being tenured.

> that mutator threads dont enter the safepoint as quickly as with our
> old software, then their long young gc pauses would remain. I really
> want them to upgrade to a more recent jdk as I know 1.6.0_18 fixed
> some CMS fragmentation bugs.

-- ramki


From y.s.ramakrishna at oracle.com  Tue Jul 12 21:56:37 2011
From: y.s.ramakrishna at oracle.com (Y. Srinivas Ramakrishna)
Date: Tue, 12 Jul 2011 21:56:37 -0700
Subject: Young GC pause time definitions
In-Reply-To: <CA+cQ+tRYe1u8Mo=5PbcWtP90FZLVY9mLgHCFYViEyvorMih_Ew@mail.gmail.com>
References: <CAJmT7L8OiDdYRbwrMy_GTc6tG04qzNC35sqeWd+D1VCHU5_yRQ@mail.gmail.com>	<4E1C8420.1080305@oracle.com>
	<CA+cQ+tRYe1u8Mo=5PbcWtP90FZLVY9mLgHCFYViEyvorMih_Ew@mail.gmail.com>
Message-ID: <4E1D2585.9060906@oracle.com>

On 7/12/2011 9:45 PM, Krystal Mok wrote:
> On Wed, Jul 13, 2011 at 1:28 AM, Y. Srinivas Ramakrishna<
> y.s.ramakrishna at oracle.com>  wrote:
>>
>>>
>>> The biggest change in our software is that we have reimplemented a big
>>> part of our main execution to rely on blocking queues (from java util
>>> concurrent) instead of an in-house queue implementation based on java
>>> monitors. I've had some wild guesses that synchronization on blocking
>>
>> Note problem with LinkedBlockingQueues (i'll dig up the CR id) that caused
>> GC behaviour similar to what you describe. Fixed in a later JDK (i'll dig
>> up
>> the details for you).
>
>
> Is it this one: http://bugs.sun.com/view_bug.do?bug_id=6805775
> Bug ID 6805775: LinkedBlockingQueue Nodes should unlink themselves before
> becoming garbage

Yes, thanks Kris, that's the one I had in mind. Given that it's indicated
as fixed in 6u19 and Christian is on 6u14, it's worth trying a JDK >= 6u19
and preferably the latest (6u26 or 6u27?).

-- ramki

From rednaxelafx at gmail.com  Tue Jul 12 21:45:35 2011
From: rednaxelafx at gmail.com (Krystal Mok)
Date: Wed, 13 Jul 2011 12:45:35 +0800
Subject: Young GC pause time definitions
In-Reply-To: <4E1C8420.1080305@oracle.com>
References: <CAJmT7L8OiDdYRbwrMy_GTc6tG04qzNC35sqeWd+D1VCHU5_yRQ@mail.gmail.com>
	<4E1C8420.1080305@oracle.com>
Message-ID: <CA+cQ+tRYe1u8Mo=5PbcWtP90FZLVY9mLgHCFYViEyvorMih_Ew@mail.gmail.com>

On Wed, Jul 13, 2011 at 1:28 AM, Y. Srinivas Ramakrishna <
y.s.ramakrishna at oracle.com> wrote:
>
> >
> > The biggest change in our software is that we have reimplemented a big
> > part of our main execution to rely on blocking queues (from java util
> > concurrent) instead of an in-house queue implementation based on java
> > monitors. I've had some wild guesses that synchronization on blocking
>
> Note problem with LinkedBlockingQueues (i'll dig up the CR id) that caused
> GC behaviour similar to what you describe. Fixed in a later JDK (i'll dig
> up
> the details for you).


Is it this one: http://bugs.sun.com/view_bug.do?bug_id=6805775
Bug ID 6805775: LinkedBlockingQueue Nodes should unlink themselves before
becoming garbage

Regards,
Kris Mok
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20110713/28a425e2/attachment.html