Unexplained long stop the world pauses during concurrent marking step in G1 Collector

Tue Sep 2 06:11:45 UTC 2014

Hi Thomas --

Thanks for the test. I think you are right that CMS doesn't yield after a
partial scan of an array either.
The dirty card rescan is for the incremental update scanning following the
initial scan. So, yes, as you state,
CMS and G1 are probably equally susceptible to this issue.

Thanks for filing the bug. I am guessing we could have a way by which
marking state could be remembered in
the form of resumption point(s) for partially scanned object arrays. One
technique I vaguely recall
in CMS was to re-dirty the unscanned part of an object array so that the
re-dirtied suffix would be picked up and scanned
at a later time. But perhaps it applied to a different part of the
scanning.... memory is a bit foggy, and I'm
just trying to get set up with the code following a longish break.

-- ramki

On Mon, Sep 1, 2014 at 8:08 AM, Thomas Schatzl <thomas.schatzl at oracle.com>
wrote:

>
> Hi all,
>
>   having had some time to investigate this issue, I can confirm the
> problem. Large reference arrays cause very long pauses.
>
> I filed https://bugs.openjdk.java.net/browse/JDK-8057003 for this problem.
>
> In addition to that, very large object arrays will trip other pathological
> performance problems.
>
> E.g. almost guaranteed mark stack overflow that prevents completion of
> the marking, leading into full gcs.
>
> On Fri, 2014-08-29 at 16:07 +0000, Krishnamurthy, Kannan wrote:
> > Ramki,
> >
> > Thanks for the detailed explanation. Will continue to profile
> >further and share the finding. Excuse my naivety, so the default
> >value of 10 ms for  G1ConcMarkStepDurationMillis doesn't still help
> >in this case ?
> > Will  G1RefProcDrainInterval be of any use ?
>
> No. The only workaround I can see is to make sure that there are no
> such large objects at all at this time.
>
> > ________________________________________
> > From: Srinivas Ramakrishna [ysr1729 at gmail.com]
> > Sent: Thursday, August 28, 2014 7:25 PM
> > To: Krishnamurthy, Kannan
> > Cc: Martin Makundi; Yu Zhang; hotspot-gc-use at openjdk.java.net;
> >kndkannan at gmail.com; Zhou, Jerry
> > Subject: Re: Unexplained long stop the world pauses during
> >concurrent marking step in G1 Collector
> >
> > It's been a while since I looked at G1 code and I'm sure it's
> >evolved a bunch sine then...
> >
> > Hi Kannan --
> >
> > As you surmised, it's likely that the marking step isn't checking
> > at a sufficiently fine granularity whether a safepoint has been
> > requested. Or, equivalently, the marking step is doing too much
> > work in one "step", thus preventing a safepoint while the marking
> > step is in progress. If you have GC logs from the application, you
> > could look at the allocation rates that you observe and compare
> > the rates during the marking phase and outside of the marking
> > phase. I am guessing that because
> > of this, the making phase must be slowing down allocation, and we
> > can get a measure of that from your GC logs. It is clear from your
> > stack traces that
> > the mutators are all blocked for allocation, while a safepoint is
> > waiting for the marking step to yield.
> >
> > It could be (from the stack retrace) that we are scanning from a
> > gigantic obj array and perhaps the marking step can yield only after
> > the entire array has been scanned. In which case, the use of large
> > object arrays (or hash tables) could be a performance anti-pattern
> > for G1.
> > Perhaps we should allow for partial scanning of arrays -- i can't
> > recall if CMS does that for marking -- save the
> > state of the partial scan and resume from that point after the
> > yield (which occurs at  a sufficiently fine granularity).
>
> CMS does not split large objects or yields on parts of large objects
> either as far as I can see. Some parts of the scanning only scan
> dirty cards within these objects, but I am not sure that this is
> sufficient in all cases. I think it is unrelated.
>
> Maybe there is somebody with more experience on the CMS code that can
> verify this.
>
> > This used to be an issue with CMS as well in the early days and we
> > had to refine the granularity of the marking steps
> > (or the so-called "concurrent work yield points" -- points at which
> > the marking will stop to allow a scavenge to proceed). I am
> > guessing we'll need to refine the granularity at which G1 does
> > these yields to allow a young collection to proceed in a timely
> > fashion.
>
> Thanks,
>   Thomas
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20140901/35770248/attachment.html>