CMSWaitDuration unstable behavior

Thu Aug 23 05:47:59 UTC 2012

Hi Michal -- thanks for drawing my attention to your response, which I had
somehow missed in my overflowing mailbox...

On Thu, Aug 9, 2012 at 4:32 AM, Frajt, Michal <
Michal.Frajt at partner.commerzbank.com> wrote:

> Hi Ramki,****
>
> ** **
>
> The current CMSWaitDuration implementation (in the normal mode only) is
> not always waiting the maximum specified time for a scavenge to occur.
> There is just a simple CGC_lock mutex wait call with the CMSWaitDuration
> parameter but nobody checks how much time passed when the method returns.
> The code expects that the mutex is notified after the scavenge and when the
> CMS collector thread is to terminate. Unfortunately there are, to me
> unknown, other mutex notifications coming from the desynchronize method
> invocation. The CMSWaitDuration is currently only a maximum time for which
> the CMS collector can wait before it checks if a new CMS cycle should
> start. It is not directly related to the scavenge occurrence - even
> occasionally it would hit it correctly.
>

Thanks for this observation. I think you are right. AFAIK, the
"descynchronize" occurs at each global safepoint. And you are right that
safepoints can occur for other reasons than GC (for example they used to
occur for some kinds of deoptimization, although i am not sure if that's
the case today, and they do occur for bulk(?) biased lock revocation). So
you are completely correct that the notifications on that wait can occur
for other reasons than just scavenge.

Given that this is the case, and given that you have fixed the issue in
your own JVM based on HotSpot, it would be great if
you are able to share your fix to this problem as a patch on this list so
it could be reviewed for inclusion into HotSpot, since teh community of CMS
users would also gain from this fix.

Thanks again for finding the problem and fixing it!!
-- ramki

The abortable preclean phase is calling the wait_on_cms_lock with the
> CMSAbortablePrecleanWaitMillis timeout just to react on CMS thread
> termination while taking a short break between preclean iterations. The
> CMSWaitDuration time is exclusively specific to the new CMS cycle start
> (hence initiating a new initial-mark pause).****
>
> ** **
>
> The idea of the CMSScavengeBeforeInitialMark might be easier to implement
> but we would strongly prefer not to invoke yet another scavenge explicitly
> as it is unbalancing young objects aging and leads to unwanted promotions.
> Even the CMSWaitDuration was never meant to control the scheduling of the
> initial mark pause in relation to the scavenge, it would be still better to
> behave the way it was never meant to, than to have unplanned scavenge
> invocations. Someone could even think about a combined solution when it
> first waits for the specified duration and if there is no scavenge
> occurring it is explicitly invoking a scavenge before the inital-mark phase
> (or better pause) starts. Both configurable by the existing CMSWaitDuration
> and the new CMSScavengeBeforeInitialMark parameters. ****
>
> ** **
>
> We understand that the most experienced CMS engineer left Oracle. We had
> the chance to speak with him, review our CMS observations, listen to a
> wonderful presentation of the G1 collector. Since we try every year to
> change to the G1 collector but without much success. Either it crashes or
> runs far slower than the current CMS collector with the parallel new
> generation. Solving or helping both the STW phases of the CMS collector
> might be still beneficial for many clients who are not able to move to the
> G1 collector yet.****
>
> ** **
>
> Regards,****
>
> Michal****
>
> ** **
>
> ** **
>
> *From:* Srinivas Ramakrishna [mailto:ysr1729 at gmail.com]
> *Sent:* Mittwoch, 8. August 2012 20:57
> *To:* Jon Masamitsu
> *Cc:* hotspot-gc-dev at openjdk.java.net; Frajt, Michal
> *Subject:* Re: CMSWaitDuration unstable behavior****
>
> ** **
>
> Hi Michal --
>
> There's an RFE (lost in the mists of time) to piggyback initial marking
> work on a ParNew collection (and hence do it multi-threaded
> rather than single-threaded as is the case currently). But it never got
> implemented, unfortunately.
>
> That said, I understand your motivation is to reduce the duration of the
> initial mark pause in the face of using a large Eden space
> which is currently marked single-threaded.
>
> Unfortunately, CMSWaitDuration was never meant to control the scheduling
> of the initial mark pause in relation to the scavenge.
> Rather it was meant to be a maximum wait time for which the CMS collector
> would wait for a scavenge to occur -- if a scavenge
> did not occur within that time, CMS might decide to unequivocally take the
> action it might otherwise have taken immediately after
> the scavenge -- such as polling the old generation occupancy to decide if
> a new CMS cycle should start (hence initiating a new
> initial-mark pause), or if the "abortable preclean phase" should be exited
> in the absence of a scavenge occurring suitably soon.
> Thus, trying to retrofit CMSWaitDuration to meet your purpose of
> co-scheduling initial mark after scavenge is probably not the right thing
> to do.
>
> I think the easiest thing to do, as Jon suggested, is to have an explicit
> flag such as CMSScavengeBeforeInitialMark which would be
> analogous to the current role of CMSScavengeBeforeRemark. Here, ICMS wakes
> up at the normal time and takes control, but
> instead of doing an initial mark straightaay, it first initiates a
> parallel scavenge and follows that up with a single-threaded initial mark.
> Granted this will not cause an initial mark step to occur immediately
> after a "normal" scavenge as we really want, but rather cause
> an additional scavenge to happen just before an initial mark pause is
> scheduled in ICMS (exactly as is the case with the current
> CMSScavengeBeforeRemark where an extra scavenge occurs which is not very
> pleasant), but it would be far easier to implement
> without making any other changes in the system.
>
> The best solution of course is to implement the RFE to do the initial mark
> in parallel piggybacked on the scavenge and all
> your problems go away (ICMS may need a very minor adjustment for that).
> Anyone want to take a stab at parallelizing
> and piggybacking initial mark on scavenge? It would be a matter of
> extending the scavenge object and root scanning closures
> to new closures so as to not skip the references that point outside of
> young gen as is done for the normal parnew scanning closures,
> but to mark the appropriate bits in the CMS marking bit map. That's really
> theoretically all it will take.
>
> PS: Jon, if Michal takes the approach of CMSScavengeBeforeInitialMark, I'd
> say it would be useful to the broader community (not
> just ICMS users) if that were integrated into the main-line code, as it
> would be a via-media for CMS scaling in the absence of the
> piggybacking RFE which is really the best solution here.
>
> thanks!
> -- ramki****
>
> On Wed, Aug 8, 2012 at 8:11 AM, Jon Masamitsu <jon.masamitsu at oracle.com>
> wrote:****
>
> Michal,
>
> The engineer with the most experience on CMS left Oracle
> and  I suspect this is not going to get fixed in the way you want.
>
> I've create CR 7189971 to capture your comments and it will be
> reviewed along with other RFE's for CMS but I would not be
> optimistic.
>
> Since you are customizing your own VM, did you consider
> explicitly invoking a young collection before the initial mark
> the way that it is done for the remark phase with the flag
>
> CMSScavengeBeforeRemark
>
> Jon****
>
>
>
> On 8/7/2012 6:16 AM, Frajt, Michal wrote:****
>
> Hi all,
>
> We are using the incremental CMS collector for many years. We have a
> distributed application framework based on the subscribe-unsubscribe model
> where the data unsubscriptions are handled by the application layer just
> forgetting the strong reference to the distributed data. The underlying
> application framework layer is using weak references to trace the data
> requirement from the application layer. We keep the old generation
> processed permanently (incrementally) to get the week references released
> and reported within a short period of time (minutes).
>
> Unfortunately the incremental mode is missing the support for the
> CMSWaitDuration to place the initial mark phase right after the young space
> collection. With some new gen sizing optimization we went to a situation
> when the new gen is more or less big enough to keep the most of live
> objects with only a few promotions to the old gen. The incremental CMS is
> then started every minute in a random moment with pretty garbaged new gen.
> The initial mark takes 20-50 times more than a single new gen processing
> (40ms new gen, initial mark 1100ms).
>
> We decided to customize the OpenJDK 6 by adding the incremental mode
> CMSWaitDuration support. We took the same approach as the wait_on_cms_lock
> method does with the CGC_lock object. Unfortunately we realized that the
> CGC_lock mutex is additionally notified in some other situation than the
> young space collection finishing. The young space collection unrelated
> notifications are coming from the desynchronize method invocations. These
> unrelated notifications are causing the wait_on_cms_lock to return earlier
> than required. The initial mark phase is started before the young space
> collection even there is enough wait duration time specified to wait. We
> have fixed it by waiting again if the
> GenCollectedHeap::heap()->total_collections() counter is not changed after
> the CGC_long->wait method returns but not longer than the CMSWaitDuration
> in total. The initial mark is then always placed (if CMSWaitDuration is
> long enough) after the young space collection. Every initial mark phase
> takes no longer than 17ms (previously 1100ms).
>
> We tested the CMSWaitDuration behavior in the normal CMS mode. We
> specified the -XX:+UseCMSInitiatingOccupancyOnly and
> -XX:CMSInitiatingOccupancyFraction=10 to force the CMS running permanently
> (shouldConcurrentCollect should be returning true). The CMS initial-mark is
> many times started without waiting for the young space collection which
> makes the initial marking running 20-50 longer. We find this as unstable
> behavior of the CMSWaitDuration implementation related to the problem of
> the wait-notify signaling on the CGC_lock object. We disabled the explicit
> GC invocation (-XX:+DisableExplicitGC) to be sure there is no other reason
> to start the CMS initial mark phase before the young space collection.
>
> Is there any plan to get the CMSWaitDuration supported in the incremental
> mode and/or get it fixed in the normal mode?
>
> Thanks,
> Michal Frajt
>
> ****
>
> ** **
>
> ** **
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/hotspot-gc-dev/attachments/20120822/afda4b3c/attachment.htm>