The GCLocker strikes again (bad remark / GCLocker interaction)
Tony Printezis
tprintezis at twitter.com
Thu Sep 4 15:55:33 UTC 2014
FYI: I just created:
JDK-8057573: CMSScavengeBeforeRemark ignored if GCLocker is active
JDK-8057586: Explicit GC ignored if GCLocker is active
Tony
On 9/4/14, 11:20 AM, Tony Printezis wrote:
> Hi Ramki,
>
> Thanks for taking the time to reply! To follow-up on a couple of
> points in your e-mail:
>
> "A ticklish dilemma here is whether the CMS thread should wait for the
> JNI CS to clear or just plough on as is the case today. The thinking
> there was that it's better to have a longish remark pause because of
> not emptying Eden than to delay the CMS collection and risk a
> concurrent mode failure which would be much more expensive."
>
> Yeah, it's an interesting trade-off. You're right that holding up the
> remark while waiting for the GCLocker to drain could cause a
> concurrent mode failure. And, yes, the remark could be loonger too
> (more cards can be potentially dirtied while we're waiting). On the
> other hand, if the critical sections are properly written (i.e., they
> are bounded / they don't block; and in our case they definitely seem
> to be) the remark will only be delayed by a very short amount of time.
> Of course, as we all know, the above "if" is really big "IF". :-) But,
> we should not be penalizing correctly behaving code to make sure
> misbehaving code doesn't misbehave even more (read this like: we can
> put the new behavior on a flag).
>
> Regarding a concurrent mode failure being more expensive than a longer
> remark: Of course! But, our remarks when a scavenge fails tend to be
> two orders of magnitude longer than the scavenge suceeds, so quite
> disruptive. From a small test run on my MacBook that's trying to cause
> this often: with a 3G young gen, normal remarks are 20ms-25ms, when
> the scavenge fails remarks last over 1 sec.
>
> "As you alluded to in your email, the issue is a bit tricky"
>
> Oh, yes. This is why, unlike JDK-8048556 (the other GCLocker issue), I
> didn't attempt to immediately work on a fix and opted to discuss this
> first on the list.
>
> "I suspect the correct way to deal with this one and for all in a
> uniform manner might be to have vm ops that need a vacant gc-locker to
> be enqueued on a separate vm-ops queue whose operations are executed
> as soon as the gc-locker has been vacated"
>
> I do like the idea of having a way for a VM op to indicate that it
> requires the GCLocker to be inactive when the threads stop. This will
> add all the logic "in one place" (OK, spread out on the GCLocker, VM
> thread, etc. but you know what I mean) and we will be able to re-use
> it wherever we need it, instead of re-inventing the wheel everywhere
> we need this.
>
> BTW, in other but related news :-), explicit GCs also suffer from this
> problem. So, it's possible for a System.gc() to be completely ignored
> if the GCLocker is active during the safepoint (I can reproduce this
> with both CMS and ParOld). Like remark, the code that schedules the
> System.gc() VM op also doesn't synchronize properly with the GCLocker.
> And, yes, instead of a Full GC this also causes a GCLocker-induced
> young GC with a non-full eden (for the same reason it happens after
> the remarks with the failed scavenge which I described in my original
> e-mail). This is another use for the "this VM op should only be
> scheduled when the GCLocker is inactive" feature.
>
> "All this having been said, I am slightly surprised that remark pauses
> for large Edens are so poor."
>
> They are quite large edens. :-)
>
> "Have you looked at the distribution of Eden scanning times during the
> remark pause? Does Eden scanning dominate the remark cost?"
>
> I assumed it did, the long remarks only happen when the eden is not
> empty. However, I haven't looked at the scanning times. Is there
> existing instrumentation that will tell us that? We can always add
> some ourselves.
>
> "(I was also wondering if it might be possible to avoid using whatever
> was causing such frequent gc-locker activity as a temporary workaround
> until the issue w/CMS is fixed?)"
>
> We're looking into this.
>
> BTW, I'll create new CRs for the remark / GCLocker interaction and the
> System.gc() / GCLocker interaction to properly track both issues.
>
> Tony
>
> On 9/3/14, 3:11 PM, Srinivas Ramakrishna wrote:
>> Hi Tony --
>>
>> The scavenge-before-remark bailing if the gc-locker is active was an
>> expedient solution and one that I did not expend much thought to
>> as gc-lockers were considered infrequent enough not to affect the
>> bottom-line by much. I can imagine though that with very frequent
>> gc-locker
>> activity and extremely large Edens, that this can be an issue. The
>> fact that the scavenge might bail was already considered as some of the
>> comments in that section of code indicate. A ticklish dilemma here is
>> whether the CMS thread should wait for the JNI CS to clear or just plough
>> on as is the case today. The thinking there was that it's better to
>> have a longish remark pause because of not emptying Eden than to
>> delay the
>> CMS collection and risk a concurrent mode failure which would be much
>> more expensive.
>>
>> As you alluded to in your email, the issue is a bit tricky because of
>> the way scavenge before remark is currently implemented ... CMS
>> decides to
>> do a remark, stops all the mutators, then decides that it must do a
>> scavenge, which now cannot be done because the gc-locker is held, so
>> we bail from
>> the scavenge and just do the remark pause (this is safe because no
>> objects are moved). The whole set-up of CMS' vm-ops was predicated on the
>> assumption of non-interference with other operations because these
>> are in some sense "read-only" wrt the heap, so we can safely
>> schedule the safepoint at any time without any worries about moving
>> objects.
>>
>> Scavenge-before-remark is the only wrinkle in this otherwise flat and
>> smooth landscape.
>>
>> I suspect the correct way to deal with this one and for all in a
>> uniform manner might be to have vm ops that need a vacant gc-locker to
>> be enqueued on a separate vm-ops queue whose operations are executed
>> as soon as the gc-locker has been vacated (this would likely
>> be all the vm-ops other than perhaps a handful of CMS vm-ops today).
>> But this would be a fairly intrusive and delicate rewrite of the
>> vm-op and gc-locker subsystems.
>>
>> A quicker point-solution might be to split the scavenge-and-remark
>> vm-op into two separate vm ops -- one that does a (guaranteed) scavenge,
>> followed by another that does a remark -- each in a separate
>> safepoint, i.e. two separate vm-ops. One way to do this might be for
>> the CMS
>> thread to take the jni critical lock, set needs_gc() if the gc locker
>> is active, and then wait on the jni critical lock for it to be
>> cleared (which it
>> will be by the last thread exiting a JNI CS) which would initiate the
>> scavenge. If the gc locker isn't active, the scavenge can be
>> initiated straightaway
>> by the CMS thread in the same way that a JNI thread would have
>> initiated it when it was the last one exiting a JNI CS.. Once the
>> scavenge has
>> happened, the CMS thread can then do the remark in the normal way.
>> Some allocation would have happened in Eden between the scavenge and
>> the remark to follow, but hopefully that would be sufficiently small
>> as not to affect the performance of the remark. The delicate part
>> here is the
>> synchronization between gc locker state, the cms thread initiating
>> the vm-op for scavenge/remark and the jni threads, but this protocol
>> would
>> be identical to the existing one, except that the CMS thread would
>> now be a participant in that proctocol, which it never was before
>> (this might
>> call for some scaffolding in the CMS thread so it can participate).
>>
>> All this having been said, I am slightly surprised that remark pauses
>> for large Edens are so poor. I would normally expect that pointers
>> from young
>> to old would be quite few and with the Eden being scanned
>> multi-threaded (at sampled "top" boundaries -- perhaps this should
>> use TLAB
>> boundaries instead), we would be able to scale OK to larger Edens.
>> Have you looked at the distribution of Eden scanning times during the
>> remark pause? Does Eden scanning dominate the remark cost? (I was
>> also wondering if it might be possible to avoid using whatever was
>> causing such frequent gc-locker activity as a temporary workaround
>> until the issue w/CMS is fixed?)
>>
>> -- ramki
>>
>>
>>
>> On Tue, Sep 2, 2014 at 3:21 PM, Tony Printezis
>> <tprintezis at twitter.com <mailto:tprintezis at twitter.com>> wrote:
>>
>> Hi there,
>>
>> In addition to the GCLocker issue I've already mentioned
>> (JDK-8048556: unnecessary young GCs due to the GCLocker) we're
>> also hitting a second one, which in some ways is more severe in
>> some cases.
>>
>> We use quite large edens and we run with
>> -XX:+CMSScavengeBeforeRemark to empty the eden before each remark
>> to keep remark times reasonable. It turns out that when the
>> remark pause is scheduled it doesn't try to synchronize with the
>> GCLocker at all. The result is that, quite often, the scavenge
>> before remark aborts because the GCLocker is active. This leads
>> to substantially longer remarks.
>>
>> A side-effect of this is that the remark pause with the aborted
>> scavenge is immediately followed by a GCLocker-initiated GC (with
>> the eden being half empty). The aborted scavenge checks whether
>> the GCLocker is active with check_active_before_gc() which tells
>> the GCLocker to do a young GC if it's active. And the young GC is
>> done without waiting for the eden to fill up.
>>
>> The issue is very easy to reproduce with a test similar to what I
>> posted on JDK-8048556 (force concurrent cycles by adding a thread
>> that calls System.gc() every say 10 secs and set
>> -XX:+ExplicitGCInvokesConcurrent). I can reproduce this with the
>> current hotspot-gc repo.
>>
>> We were wondering whether this is a known issue and whether
>> someone is working on it. FWIW, the fix could be a bit tricky.
>>
>> Thanks,
>>
>> Tony
>>
>> --
>> Tony Printezis | JVM/GC Engineer / VM Team | Twitter
>>
>> @TonyPrintezis
>> tprintezis at twitter.com <mailto:tprintezis at twitter.com>
>>
>>
>
> --
> Tony Printezis | JVM/GC Engineer / VM Team | Twitter
>
> @TonyPrintezis
> tprintezis at twitter.com
--
Tony Printezis | JVM/GC Engineer / VM Team | Twitter
@TonyPrintezis
tprintezis at twitter.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/hotspot-gc-dev/attachments/20140904/f234433b/attachment.htm>
More information about the hotspot-gc-dev
mailing list