Very long (nearly 10s, abnormal) "Choose CSet" time in a young collection

Thu May 9 18:04:49 UTC 2019

Hi Thomas,

Thanks for your reply. The 'user' part is 5.00 for completeness :)

Yes, weeks ago we identified that we also got impacted by linux THP which
caused like 90+ seconds in sys time in a full GC. But this one looks
different as you mentioned, the sys time is very low.

I saw two things we want to understand here:

1. the 5s 'user' time was spent where?

2. the other 5s (real - user) might be possibly due to the host was heavy
loaded and the jvm got no CPU time during that?

Regards,

On Thu, May 9, 2019 at 1:48 PM Thomas Schatzl <thomas.schatzl at oracle.com>
wrote:

> Hi,
>
> On Thu, 2019-05-09 at 12:43 -0400, chao chu wrote:
> > Hi,
> > We got a weird issue in our prod environment that we observed a
> > nearly 10s "Choose CSet" time in a young GC (it's GCLocker initiated
> > though, not sure if this is relevant), while the related "start
> > choosing cset..." and "finish choosing cset..." log lines happened in
> > the same milli-sec, which just doesn't make much sense.
> >
> > Looking at the related G1 code, all related happened in this
> > `finalize_cset` function:
> > http://hg.openjdk.java.net/jdk8u/jdk8u/hotspot
> > /file/c16cc0cdcce5/src/share/vm/gc_implementation/g1/g1CollectorPolic
> > y.cpp#l1949
> >
> > the reported value in the "Choose CSet" is simply the sum of these
> > two:
> >
> [...]
> >
> > From your experience, what could be the causes for the stuck? any
> > particular os/host level metrics we should check? (only saw this once
> > and couldn't re-produce it). Many thanks in advance, any insights are
> > welcome :)
> >
> >
> > The sample gc log sections below (due to our firm's policy, I
> > couldn't post
> > the gc log directly to public, sorry about that):
> >
> >
> >    [Other:   9698.8 ms]
> >       [Choose CSet:   *9677.5 ms*]
> >       [Ref Proc:   0.3 ms]
> >       [Ref Enq:   0.0 ms]
> >       [Free CSet:   0.3 ms]
> >    [Eden: 18G(18G)->0B(352M) Survivors: 1232M->1280M Heap:
> > 26.5G(32G)->8175.2M(32G)]
> >  [Times: user=5 sys=0.09, real=10.55 secs]
>
>              ^^--- there seems to be something cut here for user time
> statistic
>
> Since you mentioned that this has been a one-off event, and given your
> heap size it is unlikely that this phase would take that long on its
> own. The cause may be one of the known causes for such unexpected
> pauses:
>
> - swapping/memory issues on the machine, stalling the java process and
> basically everything else at that time.
>
> - the kernel trying to coalesce small pages into huge pages
> "transparently".
>
> - the VM trying to write out the log to disk during very high I/O
> activity of another process on the same machine, blocking for a long
> time.
>
> On the other hand sys time seems very low in this case, so I may be
> wrong.
>
> I would start with something like vmstat (on Linux) to show i/o and
> memory activity.
>
> (Blatantly taken from
>
> https://docs.oracle.com/en/java/javase/11/gctuning/garbage-first-garbage-collector-tuning.html#GUID-8D9B2530-E370-4B8B-8ADD-A43674FC6658
>  :) )
>
> Thanks,
>   Thomas
>
>
>

-- 
ChuChao