Unexplained long stop the world pauses during concurrent marking step in G1 Collector

Mon Sep 1 15:08:37 UTC 2014

Hi all,

  having had some time to investigate this issue, I can confirm the
problem. Large reference arrays cause very long pauses.

I filed https://bugs.openjdk.java.net/browse/JDK-8057003 for this problem.

In addition to that, very large object arrays will trip other pathological
performance problems.

E.g. almost guaranteed mark stack overflow that prevents completion of
the marking, leading into full gcs.

On Fri, 2014-08-29 at 16:07 +0000, Krishnamurthy, Kannan wrote:
> Ramki, 
> 
> Thanks for the detailed explanation. Will continue to profile
>further and share the finding. Excuse my naivety, so the default
>value of 10 ms for  G1ConcMarkStepDurationMillis doesn't still help
>in this case ? 
> Will  G1RefProcDrainInterval be of any use ?

No. The only workaround I can see is to make sure that there are no
such large objects at all at this time.

> ________________________________________
> From: Srinivas Ramakrishna [ysr1729 at gmail.com]
> Sent: Thursday, August 28, 2014 7:25 PM
> To: Krishnamurthy, Kannan
> Cc: Martin Makundi; Yu Zhang; hotspot-gc-use at openjdk.java.net;
>kndkannan at gmail.com; Zhou, Jerry
> Subject: Re: Unexplained long stop the world pauses during
>concurrent marking step in G1 Collector
> 
> It's been a while since I looked at G1 code and I'm sure it's
>evolved a bunch sine then...
> 
> Hi Kannan --
> 
> As you surmised, it's likely that the marking step isn't checking
> at a sufficiently fine granularity whether a safepoint has been
> requested. Or, equivalently, the marking step is doing too much
> work in one "step", thus preventing a safepoint while the marking
> step is in progress. If you have GC logs from the application, you
> could look at the allocation rates that you observe and compare
> the rates during the marking phase and outside of the marking
> phase. I am guessing that because
> of this, the making phase must be slowing down allocation, and we
> can get a measure of that from your GC logs. It is clear from your
> stack traces that
> the mutators are all blocked for allocation, while a safepoint is
> waiting for the marking step to yield.
> 
> It could be (from the stack retrace) that we are scanning from a
> gigantic obj array and perhaps the marking step can yield only after
> the entire array has been scanned. In which case, the use of large
> object arrays (or hash tables) could be a performance anti-pattern
> for G1.
> Perhaps we should allow for partial scanning of arrays -- i can't
> recall if CMS does that for marking -- save the
> state of the partial scan and resume from that point after the
> yield (which occurs at  a sufficiently fine granularity).

CMS does not split large objects or yields on parts of large objects
either as far as I can see. Some parts of the scanning only scan
dirty cards within these objects, but I am not sure that this is
sufficient in all cases. I think it is unrelated.

Maybe there is somebody with more experience on the CMS code that can
verify this.

> This used to be an issue with CMS as well in the early days and we
> had to refine the granularity of the marking steps
> (or the so-called "concurrent work yield points" -- points at which
> the marking will stop to allow a scavenge to proceed). I am
> guessing we'll need to refine the granularity at which G1 does
> these yields to allow a young collection to proceed in a timely
> fashion.

Thanks,
  Thomas