CMSWaitDuration unstable behavior

Wed Aug 8 18:56:53 UTC 2012

Hi Michal --

There's an RFE (lost in the mists of time) to piggyback initial marking
work on a ParNew collection (and hence do it multi-threaded
rather than single-threaded as is the case currently). But it never got
implemented, unfortunately.

That said, I understand your motivation is to reduce the duration of the
initial mark pause in the face of using a large Eden space
which is currently marked single-threaded.

Unfortunately, CMSWaitDuration was never meant to control the scheduling of
the initial mark pause in relation to the scavenge.
Rather it was meant to be a maximum wait time for which the CMS collector
would wait for a scavenge to occur -- if a scavenge
did not occur within that time, CMS might decide to unequivocally take the
action it might otherwise have taken immediately after
the scavenge -- such as polling the old generation occupancy to decide if a
new CMS cycle should start (hence initiating a new
initial-mark pause), or if the "abortable preclean phase" should be exited
in the absence of a scavenge occurring suitably soon.
Thus, trying to retrofit CMSWaitDuration to meet your purpose of
co-scheduling initial mark after scavenge is probably not the right thing
to do.

I think the easiest thing to do, as Jon suggested, is to have an explicit
flag such as CMSScavengeBeforeInitialMark which would be
analogous to the current role of CMSScavengeBeforeRemark. Here, ICMS wakes
up at the normal time and takes control, but
instead of doing an initial mark straightaay, it first initiates a parallel
scavenge and follows that up with a single-threaded initial mark.
Granted this will not cause an initial mark step to occur immediately after
a "normal" scavenge as we really want, but rather cause
an additional scavenge to happen just before an initial mark pause is
scheduled in ICMS (exactly as is the case with the current
CMSScavengeBeforeRemark where an extra scavenge occurs which is not very
pleasant), but it would be far easier to implement
without making any other changes in the system.

The best solution of course is to implement the RFE to do the initial mark
in parallel piggybacked on the scavenge and all
your problems go away (ICMS may need a very minor adjustment for that).
Anyone want to take a stab at parallelizing
and piggybacking initial mark on scavenge? It would be a matter of
extending the scavenge object and root scanning closures
to new closures so as to not skip the references that point outside of
young gen as is done for the normal parnew scanning closures,
but to mark the appropriate bits in the CMS marking bit map. That's really
theoretically all it will take.

PS: Jon, if Michal takes the approach of CMSScavengeBeforeInitialMark, I'd
say it would be useful to the broader community (not
just ICMS users) if that were integrated into the main-line code, as it
would be a via-media for CMS scaling in the absence of the
piggybacking RFE which is really the best solution here.

thanks!
-- ramki

On Wed, Aug 8, 2012 at 8:11 AM, Jon Masamitsu <jon.masamitsu at oracle.com>wrote:

> Michal,
>
> The engineer with the most experience on CMS left Oracle
> and  I suspect this is not going to get fixed in the way you want.
>
> I've create CR 7189971 to capture your comments and it will be
> reviewed along with other RFE's for CMS but I would not be
> optimistic.
>
> Since you are customizing your own VM, did you consider
> explicitly invoking a young collection before the initial mark
> the way that it is done for the remark phase with the flag
>
> CMSScavengeBeforeRemark
>
> Jon
>
>
> On 8/7/2012 6:16 AM, Frajt, Michal wrote:
>
>> Hi all,
>>
>> We are using the incremental CMS collector for many years. We have a
>> distributed application framework based on the subscribe-unsubscribe model
>> where the data unsubscriptions are handled by the application layer just
>> forgetting the strong reference to the distributed data. The underlying
>> application framework layer is using weak references to trace the data
>> requirement from the application layer. We keep the old generation
>> processed permanently (incrementally) to get the week references released
>> and reported within a short period of time (minutes).
>>
>> Unfortunately the incremental mode is missing the support for the
>> CMSWaitDuration to place the initial mark phase right after the young space
>> collection. With some new gen sizing optimization we went to a situation
>> when the new gen is more or less big enough to keep the most of live
>> objects with only a few promotions to the old gen. The incremental CMS is
>> then started every minute in a random moment with pretty garbaged new gen.
>> The initial mark takes 20-50 times more than a single new gen processing
>> (40ms new gen, initial mark 1100ms).
>>
>> We decided to customize the OpenJDK 6 by adding the incremental mode
>> CMSWaitDuration support. We took the same approach as the wait_on_cms_lock
>> method does with the CGC_lock object. Unfortunately we realized that the
>> CGC_lock mutex is additionally notified in some other situation than the
>> young space collection finishing. The young space collection unrelated
>> notifications are coming from the desynchronize method invocations. These
>> unrelated notifications are causing the wait_on_cms_lock to return earlier
>> than required. The initial mark phase is started before the young space
>> collection even there is enough wait duration time specified to wait. We
>> have fixed it by waiting again if the GenCollectedHeap::heap()->**total_collections()
>> counter is not changed after the CGC_long->wait method returns but not
>> longer than the CMSWaitDuration in total. The initial mark is then always
>> placed (if CMSWaitDuration is long enough) after the young space
>> collection. Every initial mark phase takes no longer than 17ms (previously
>> 1100ms).
>>
>> We tested the CMSWaitDuration behavior in the normal CMS mode. We
>> specified the -XX:+**UseCMSInitiatingOccupancyOnly and -XX:**
>> CMSInitiatingOccupancyFraction**=10 to force the CMS running permanently
>> (shouldConcurrentCollect should be returning true). The CMS initial-mark is
>> many times started without waiting for the young space collection which
>> makes the initial marking running 20-50 longer. We find this as unstable
>> behavior of the CMSWaitDuration implementation related to the problem of
>> the wait-notify signaling on the CGC_lock object. We disabled the explicit
>> GC invocation (-XX:+DisableExplicitGC) to be sure there is no other reason
>> to start the CMS initial mark phase before the young space collection.
>>
>> Is there any plan to get the CMSWaitDuration supported in the incremental
>> mode and/or get it fixed in the normal mode?
>>
>> Thanks,
>> Michal Frajt
>>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/hotspot-gc-dev/attachments/20120808/a7477042/attachment.htm>