The GCLocker strikes again (bad remark / GCLocker interaction)

Wed Sep 3 19:11:17 UTC 2014

Hi Tony --

The scavenge-before-remark bailing if the gc-locker is active was an
expedient solution and one that I did not expend much thought to
as gc-lockers were considered infrequent enough not to affect the
bottom-line by much. I can imagine though that with very frequent gc-locker
activity and extremely large Edens, that this can be an issue. The fact
that the scavenge might bail was already considered as some of the
comments in that section of code indicate. A ticklish dilemma here is
whether the CMS thread should wait for the JNI CS to clear or just plough
on as is the case today. The thinking there was that it's better to have a
longish remark pause because of not emptying Eden than to delay the
CMS collection and risk a concurrent mode failure which would be much more
expensive.

As you alluded to in your email, the issue is a bit tricky because of the
way scavenge before remark is currently implemented ... CMS decides to
do a remark, stops all the mutators, then decides that it must do a
scavenge, which now cannot be done because the gc-locker is held, so we
bail from
the scavenge and just do the remark pause (this is safe because no objects
are moved). The whole set-up of CMS' vm-ops was predicated on the
assumption of non-interference with other operations because these are in
some sense "read-only" wrt the heap, so we can safely
schedule the safepoint at any time without any worries about moving objects.

Scavenge-before-remark is the only wrinkle in this otherwise flat and
smooth landscape.

I suspect the correct way to deal with this one and for all in a uniform
manner might be to have vm ops that need a vacant gc-locker to
be enqueued on a separate vm-ops queue whose operations are executed as
soon as the gc-locker has been vacated (this would likely
be all the vm-ops other than perhaps a handful of CMS vm-ops today). But
this would be a fairly intrusive and delicate rewrite of the
vm-op and gc-locker subsystems.

A quicker point-solution might be to split the scavenge-and-remark vm-op
into two separate vm ops -- one that does a (guaranteed) scavenge,
followed by another that does a remark -- each in a separate safepoint,
i.e. two separate vm-ops. One way to do this might be for the CMS
thread to take the jni critical lock, set needs_gc() if the gc locker is
active, and then wait on the jni critical lock for it to be cleared (which
it
will be by the last thread exiting a JNI CS) which would initiate the
scavenge. If the gc locker isn't active, the scavenge can be initiated
straightaway
by the CMS thread in the same way that a JNI thread would have initiated it
when it was the last one exiting a JNI CS.. Once the scavenge has
happened, the CMS thread can then do the remark in the normal way. Some
allocation would have happened in Eden between the scavenge and
the remark to follow, but hopefully that would be sufficiently small as not
to affect the performance of the remark. The delicate part here is the
synchronization between gc locker state, the cms thread initiating the
vm-op for scavenge/remark and the jni threads, but this protocol would
be identical to the existing one, except that the CMS thread would now be a
participant in that proctocol, which it never was before (this might
call for some scaffolding in the CMS thread so it can participate).

All this having been said, I am slightly surprised that remark pauses for
large Edens are so poor. I would normally expect that pointers from young
to old would be quite few and with the Eden being scanned multi-threaded
(at sampled "top" boundaries -- perhaps this should use TLAB
boundaries instead), we would be able to scale OK to larger Edens. Have you
looked at the distribution of Eden scanning times during the
remark pause? Does Eden scanning dominate the remark cost? (I was also
wondering if it might be possible to avoid using whatever was
causing such frequent gc-locker activity as a temporary workaround until
the issue w/CMS is fixed?)

-- ramki

On Tue, Sep 2, 2014 at 3:21 PM, Tony Printezis <tprintezis at twitter.com>
wrote:

> Hi there,
>
> In addition to the GCLocker issue I've already mentioned (JDK-8048556:
> unnecessary young GCs due to the GCLocker) we're also hitting a second one,
> which in some ways is more severe in some cases.
>
> We use quite large edens and we run with -XX:+CMSScavengeBeforeRemark to
> empty the eden before each remark to keep remark times reasonable. It turns
> out that when the remark pause is scheduled it doesn't try to synchronize
> with the GCLocker at all. The result is that, quite often, the scavenge
> before remark aborts because the GCLocker is active. This leads to
> substantially longer remarks.
>
> A side-effect of this is that the remark pause with the aborted scavenge
> is immediately followed by a GCLocker-initiated GC (with the eden being
> half empty). The aborted scavenge checks whether the GCLocker is active
> with check_active_before_gc() which tells the GCLocker to do a young GC if
> it's active. And the young GC is done without waiting for the eden to fill
> up.
>
> The issue is very easy to reproduce with a test similar to what I posted
> on JDK-8048556 (force concurrent cycles by adding a thread that calls
> System.gc() every say 10 secs and set -XX:+ExplicitGCInvokesConcurrent).
> I can reproduce this with the current hotspot-gc repo.
>
> We were wondering whether this is a known issue and whether someone is
> working on it. FWIW, the fix could be a bit tricky.
>
> Thanks,
>
> Tony
>
> --
> Tony Printezis | JVM/GC Engineer / VM Team | Twitter
>
> @TonyPrintezis
> tprintezis at twitter.com
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/hotspot-gc-dev/attachments/20140903/f4a16bb4/attachment.htm>