CMS initial mark pauses

Fri Oct 15 18:49:44 UTC 2010

On Thu, Oct 14, 2010 at 19:00, Y. S. Ramakrishna <y.s.ramakrishna at oracle.com
> wrote:

>
> Hi Adam --
>
> ...
>
>
>> I understood before that "initial" is not done in parallel.  I'm curious -
>> why not?
>>
>
> When it was first implemented CMS did all its work single-threaded
> over a serial scavenger. It was incrementally parallelized over time
> but because initial-mark pauses ere usually not a concern (small edens,
> small survivor spaces, initial mark immediately following a scavenge)
> it never rose high enough in priority to parallelize. Clearly we have
> reached a point where the old assumptions no longer hold and it's
> time to parallelize it. Or better still to move to G1 which is fully
> parallel and concurrent, and have other advantages as well.
>
>
Thanks for the history lesson!  We did mention G1 to our customer yesterday,
but I'm not yet familiar enough with its tuning knobs to be confident to
suggest it for a production system.  We've only done minimal testing
in-house, and not yet on the scale of this customer.

More generally, for ParGC and CMS, our heuristic has been to set heap size,
configure new size, and then if necessary, configure survivor spaces and
maybe some other knobs to fulfill our customer requirements.  I don't know
what the equivalent settings are for G1.  I'm curious if there's a similar
"recipe" for getting it configured and tuned.  When we tried earlier, we
didn't have much success with it.  Can anyone who's spent significant time
tuning it relate their experiences?  Is it worth trying on 2-4 core systems
with 1-4g of RAM?

>
>
>> I have CMSInitiatingOccupancyFraction=50 because I was concerned about
>> some finalization issues in our application, and I thought I remembered
>> reference processing wasn't done in young GC's.  After enabling
>> PrintReferenceGC, the logs imply  ParNewGC also clears references - is that
>> true?  If so, it may not be necessary for us to include that option anyway.
>>
>
> Yes, scavenges do process unreachable Reference objects found in the young
> gen.
> However, once these get into the old gen, you are right that you will need
> a
> CMS cycle to identify them as unreachable and to process them
> appropriately.

Thanks for the confirmation.

>        (1) use no survivor spaces (at the risk of larger scavenge
>>        pauses, larger remark pauses,
>>           even concurrent mode failures)
>>        (2) use a sufficiently large heap so as to be able to afford to
>>        set a
>>           mark initiation threshold above the low water-mark (after a
>> major
>>           collection cycle). This will keep init-mark's riding on the
>>        coat-tails
>>           of scavenges.
>>
>> The customer's application appears to fit neatly in a 2.4G heap, and we
>> have -Xmx4g, so I believe we might be able to apply (2) here.  Is (1) above
>> required along with (2), or do these workarounds address the problem
>> independently?  I ask because (a) this customer is already concerned about
>> pause times, so I don't have a lot of room to increase remark and scavenge
>> times, and (b) I'm concerned about eliminating survivor spaces since we've
>> dealt with significant heap fragmentation in the past.
>>
>
> Precisely. The two are actually additive, but either by itself may not
> be sufficient, and as you pointed out (1) may not even be always feasible.

I reduced the survivor spaces in my recommendation for today but did not
completely eliminate them, and increased the old gen size.  Unfortunately,
the customer made a mistake in the settings that disabled
-XX:+PrintGCDetails, so they failed to get new logs.  They reported that
their user experience was slightly worse, but without logs, I can't
determine whether the GC's are the problem or something else.

> One other data point is that we have a large number of mostly idle threads
>> (3826 at one count), with most of the idle threads holding onto
>> approximately 2MB of object data.  I don't know if that would significantly
>> contribute to the initial mark pause, but my intuition is that it would
>> increase the time if some of that time is spent marking the stack locals.
>>
>
> Yes, that could be, but probably less significant than a large Eden or
> survivor
> space, given that when the CMS initial-mark pauses come immediately after
> a scavenge, the pauses are much shorter, so the larger contribution is
> from the large Eden. If you pour your GC logs into GCHisto, you
> should probably see that the CMS intial-mark pauses increase as
> the most recent scavenge becomes more distant (or you could plot that
> via a spreadsheet and note that relationship).
>

Ok, I checked it in gchisto and you were exactly right.  This was
immediately obvious.

Thanks for your help again.

> -- ramki
>
>
>>
>>
>>
>>        Also, if using iCMS (Inceremental CMS), drop the Incremental
>>        mode and revert to
>>        vanilla CMS.
>>        *** (#2 of 2): 2010-04-14 11:02:03 PDT xxxx at oracle.com
>>        <mailto:xxxx at oracle.com>
>>
>>
>>
>>    If you have support, you can try escalating it via your support
>> channels
>>    to get this addressed, especially if the workaround/retuning doesn't
>>    do the job.
>>
>>    -- ramki
>>
>>
>> My option seems to be to eliminate the CMSInitiatingOccupancyFraction=50
>> and keep the -Xmx4g.  Would it be prudent to set -Xms4g also?
>>
>> And the log excerpt from a steady-state in the application.  The sigma on
>> pause times for young gc and remark is 17ms and 26ms - they're like
>> clockwork.  The initial mark is higher, 334ms due to the large-valued
>> outliers.
>>
>>
>>
> ...
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/hotspot-gc-dev/attachments/20101015/0ca825d9/attachment.htm>
-------------- next part --------------
_______________________________________________
hotspot-gc-use mailing list
hotspot-gc-use at openjdk.java.net
http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use