CMSWaitDuration unstable behavior
Frajt, Michal
Michal.Frajt at partner.commerzbank.com
Tue Aug 7 13:16:57 UTC 2012
Hi all,
We are using the incremental CMS collector for many years. We have a distributed application framework based on the subscribe-unsubscribe model where the data unsubscriptions are handled by the application layer just forgetting the strong reference to the distributed data. The underlying application framework layer is using weak references to trace the data requirement from the application layer. We keep the old generation processed permanently (incrementally) to get the week references released and reported within a short period of time (minutes).
Unfortunately the incremental mode is missing the support for the CMSWaitDuration to place the initial mark phase right after the young space collection. With some new gen sizing optimization we went to a situation when the new gen is more or less big enough to keep the most of live objects with only a few promotions to the old gen. The incremental CMS is then started every minute in a random moment with pretty garbaged new gen. The initial mark takes 20-50 times more than a single new gen processing (40ms new gen, initial mark 1100ms).
We decided to customize the OpenJDK 6 by adding the incremental mode CMSWaitDuration support. We took the same approach as the wait_on_cms_lock method does with the CGC_lock object. Unfortunately we realized that the CGC_lock mutex is additionally notified in some other situation than the young space collection finishing. The young space collection unrelated notifications are coming from the desynchronize method invocations. These unrelated notifications are causing the wait_on_cms_lock to return earlier than required. The initial mark phase is started before the young space collection even there is enough wait duration time specified to wait. We have fixed it by waiting again if the GenCollectedHeap::heap()->total_collections() counter is not changed after the CGC_long->wait method returns but not longer than the CMSWaitDuration in total. The initial mark is then always placed (if CMSWaitDuration is long enough) after the young space collection. Every initial mark phase takes no longer than 17ms (previously 1100ms).
We tested the CMSWaitDuration behavior in the normal CMS mode. We specified the -XX:+UseCMSInitiatingOccupancyOnly and -XX:CMSInitiatingOccupancyFraction=10 to force the CMS running permanently (shouldConcurrentCollect should be returning true). The CMS initial-mark is many times started without waiting for the young space collection which makes the initial marking running 20-50 longer. We find this as unstable behavior of the CMSWaitDuration implementation related to the problem of the wait-notify signaling on the CGC_lock object. We disabled the explicit GC invocation (-XX:+DisableExplicitGC) to be sure there is no other reason to start the CMS initial mark phase before the young space collection.
Is there any plan to get the CMSWaitDuration supported in the incremental mode and/or get it fixed in the normal mode?
Thanks,
Michal Frajt
More information about the hotspot-gc-dev
mailing list