Unexplained long stop the world pauses during concurrent marking step in G1 Collector
Srinivas Ramakrishna
ysr1729 at gmail.com
Thu Aug 28 23:25:34 UTC 2014
It's been a while since I looked at G1 code and I'm sure it's evolved a
bunch sine then...
Hi Kannan --
As you surmised, it's likely that the marking step isn't checking at a
sufficiently fine granularity whether a safepoint has been requested. Or,
equivalently, the marking step is doing too much work in one "step", thus
preventing a safepoint while the marking step is in progress. If you have
GC logs from the application, you
could look at the allocation rates that you observe and compare the rates
during the marking phase and outside of the marking phase. I am guessing
that because
of this, the making phase must be slowing down allocation, and we can get a
measure of that from your GC logs. It is clear from your stack traces that
the mutators are all blocked for allocation, while a safepoint is waiting
for the marking step to yield.
It could be (from the stack retrace) that we are scanning from a gigantic
obj array and perhaps the marking step can yield only after
the entire array has been scanned. In which case, the use of large object
arrays (or hash tables) could be a performance anti-pattern for G1.
Perhaps we should allow for partial scanning of arrays -- i can't recall if
CMS does that for marking -- save the
state of the partial scan and resume from that point after the yield (which
occurs at a sufficiently fine granularity).
This used to be an issue with CMS as well in the early days and we had to
refine the granularity of the marking steps
(or the so-called "concurrent work yield points" -- points at which the
marking will stop to allow a scavenge to proceed). I am
guessing we'll need to refine the granularity at which G1 does these yields
to allow a young collection to proceed in a timely fashion.
-- ramki
On Thu, Aug 28, 2014 at 12:38 PM, Krishnamurthy, Kannan <
Kannan.Krishnamurthy at contractor.cengage.com> wrote:
> We managed to get couple of native stack traces when the pause was
> happening.
>
> From the thread dump, we don't see any application thread blocking the
> JVM. We are wondering whether the JVM Safepoint and a single concurrent
> marking thread are blocking each other.
>
> Links to the complete stack traces (~5.5MB each):
> First :
> https://drive.google.com/file/d/0B95RUv-vjsAfcE95MlVmMVhKZTQ/edit?usp=sharing
> Second :
> https://drive.google.com/file/d/0B95RUv-vjsAfNklxMnZVaEI2UTg/edit?usp=sharing
>
> Most of the threads are in some kinds of waiting (pthread_cond_wait@@GLIBC_2.3.2,
> epoll_wait, or pthread_cond_timedwait).
>
> We see that only one thread was doing any work across two successive stack
> traces 2 seconds aparts. For that thread 1820, in the first stack trace,
> we see:
>
> Thread 1820 (Thread 0x7f266f2f5700 (LWP 17402)):
> #0 0x00007f26bc55f604 in Monitor::ILock(Thread*) () from
> /usr/java/jdk1.7.0_04/jre/lib/amd64/server/libjvm.so
> #1 0x00007f26bc55f9bf in Monitor::lock_without_safepoint_check() () from
> /usr/java/jdk1.7.0_04/jre/lib/amd64/server/libjvm.so
> #2 0x00007f26bc229779 in CMTask::move_entries_to_global_stack() () from
> /usr/java/jdk1.7.0_04/jre/lib/amd64/server/libjvm.so
> #3 0x00007f26bc230974 in CMTask::push(oopDesc*) () from
> /usr/java/jdk1.7.0_04/jre/lib/amd64/server/libjvm.so
> #4 0x00007f26bc230b39 in CMTask::deal_with_reference(oopDesc*) () from
> /usr/java/jdk1.7.0_04/jre/lib/amd64/server/libjvm.so
> #5 0x00007f26bc582ef7 in objArrayKlass::oop_oop_iterate_nv(oopDesc*,
> G1CMOopClosure*) () from
> /usr/java/jdk1.7.0_04/jre/lib/amd64/server/libjvm.so
> #6 0x00007f26bc22abc5 in CMTask::drain_local_queue(bool) () from
> /usr/java/jdk1.7.0_04/jre/lib/amd64/server/libjvm.so
> #7 0x00007f26bc233562 in CMBitMapClosure::do_bit(unsigned long) () from
> /usr/java/jdk1.7.0_04/jre/lib/amd64/server/libjvm.so
> #8 0x00007f26bc22b955 in CMTask::do_marking_step(double, bool, bool) ()
> from /usr/java/jdk1.7.0_04/jre/lib/amd64/server/libjvm.so
> #9 0x00007f26bc232d8b in CMConcurrentMarkingTask::work(unsigned int) ()
> from /usr/java/jdk1.7.0_04/jre/lib/amd64/server/libjvm.so
> #10 0x00007f26bc70eaff in GangWorker::loop() () from
> /usr/java/jdk1.7.0_04/jre/lib/amd64/server/libjvm.so
> #11 0x00007f26bc599ff0 in java_start(Thread*) () from
> /usr/java/jdk1.7.0_04/jre/lib/amd64/server/libjvm.so
> #12 0x00007f26bd3177b6 in start_thread () from /lib64/libpthread.so.0
> #13 0x00007f26bcc51c5d in clone () from /lib64/libc.so.6
> #14 0x0000000000000000 in ?? ()
> ----------------------
>
> For the same thread 1820 in the second trace, we see:
>
> Thread 1820 (Thread 0x7f266f2f5700 (LWP 17402)):
> #0 0x00007f26bc208080 in oopDesc::size() () from
> /usr/java/jdk1.7.0_04/jre/lib/amd64/server/libjvm.so
> #1 0x00007f26bc230ace in CMTask::deal_with_reference(oopDesc*) () from
> /usr/java/jdk1.7.0_04/jre/lib/amd64/server/libjvm.so
> #2 0x00007f26bc582ef7 in objArrayKlass::oop_oop_iterate_nv(oopDesc*,
> G1CMOopClosure*) () from
> /usr/java/jdk1.7.0_04/jre/lib/amd64/server/libjvm.so
> #3 0x00007f26bc22abc5 in CMTask::drain_local_queue(bool) () from
> /usr/java/jdk1.7.0_04/jre/lib/amd64/server/libjvm.so
> #4 0x00007f26bc233562 in CMBitMapClosure::do_bit(unsigned long) () from
> /usr/java/jdk1.7.0_04/jre/lib/amd64/server/libjvm.so
> #5 0x00007f26bc22b955 in CMTask::do_marking_step(double, bool, bool) ()
> from /usr/java/jdk1.7.0_04/jre/lib/amd64/server/libjvm.so
> #6 0x00007f26bc232d8b in CMConcurrentMarkingTask::work(unsigned int) ()
> from /usr/java/jdk1.7.0_04/jre/lib/amd64/server/libjvm.so
> #7 0x00007f26bc70eaff in GangWorker::loop() () from
> /usr/java/jdk1.7.0_04/jre/lib/amd64/server/libjvm.so
> #8 0x00007f26bc599ff0 in java_start(Thread*) () from
> /usr/java/jdk1.7.0_04/jre/lib/amd64/server/libjvm.so
> #9 0x00007f26bd3177b6 in start_thread () from /lib64/libpthread.so.0
> #10 0x00007f26bcc51c5d in clone () from /lib64/libc.so.6
> #11 0x0000000000000000 in ?? ()
>
>
>
> ***************************************************************************************
>
> The other thread interesting thread VMThread is trying to start a
> Safepoint Synchronize.
>
> Thread 1816 (Thread 0x7f266ec26700 (LWP 17406)):
> #0 0x00007f26bd31b61c in pthread_cond_wait@@GLIBC_2.3.2 () from
> /lib64/libpthread.so.0
> #1 0x00007f26bc594103 in os::PlatformEvent::park() () from
> /usr/java/jdk1.7.0_04/jre/lib/amd64/server/libjvm.so
> #2 0x00007f26bc55ff0f in Monitor::IWait(Thread*, long) () from
> /usr/java/jdk1.7.0_04/jre/lib/amd64/server/libjvm.so
> #3 0x00007f26bc56069e in Monitor::wait(bool, long, bool) () from
> /usr/java/jdk1.7.0_04/jre/lib/amd64/server/libjvm.so
> #4 0x00007f26bc2249b5 in ConcurrentGCThread::safepoint_synchronize() ()
> from /usr/java/jdk1.7.0_04/jre/lib/amd64/server/libjvm.so
> #5 0x00007f26bc61d4fb in SafepointSynchronize::begin() () from
> /usr/java/jdk1.7.0_04/jre/lib/amd64/server/libjvm.so
> #6 0x00007f26bc700227 in VMThread::loop() () from
> /usr/java/jdk1.7.0_04/jre/lib/amd64/server/libjvm.so
> #7 0x00007f26bc7008d0 in VMThread::run() () from
> /usr/java/jdk1.7.0_04/jre/lib/amd64/server/libjvm.so
> #8 0x00007f26bc599ff0 in java_start(Thread*) () from
> /usr/java/jdk1.7.0_04/jre/lib/amd64/server/libjvm.so
> #9 0x00007f26bd3177b6 in start_thread () from /lib64/libpthread.so.0
> #10 0x00007f26bcc51c5d in clone () from /lib64/libc.so.6
> #11 0x0000000000000000 in ?? ()
>
> **********************************************
> The complete break down of rest of threads as below:
>
> (1632) Threads with stack
> Thread 1818 (Thread 0x7f266f0f3700 (LWP 17404)):
> #0 0x00007f26bd31b61c in pthread_cond_wait@@GLIBC_2.3.2 () from
> /lib64/libpthread.so.0
> #1 0x00007f26bc594103 in os::PlatformEvent::park() () from
> /usr/java/jdk1.7.0_04/jre/lib/amd64/server/libjvm.so
> #2 0x00007f26bc55ff0f in Monitor::IWait(Thread*, long) () from
> /usr/java/jdk1.7.0_04/jre/lib/amd64/server/libjvm.so
> #3 0x00007f26bc56069e in Monitor::wait(bool, long, bool) () from
> /usr/java/jdk1.7.0_04/jre/lib/amd64/server/libjvm.so
>
>
> (156) Threads with stack
> Thread 10 (Thread 0x7f1cf6e31700 (LWP 973)):
> #0 0x00007f26bd31b61c in pthread_cond_wait@@GLIBC_2.3.2 () from
> /lib64/libpthread.so.0
> #1 0x00007f26bc598427 in Parker::park(bool, long) () from
> /usr/java/jdk1.7.0_04/jre/lib/amd64/server/libjvm.so
> #2 0x00007f26bc6d84ad in Unsafe_Park () from
> /usr/java/jdk1.7.0_04/jre/lib/amd64/server/libjvm.so
>
>
>
> (24) Threads with stack
> #0 0x00007f26bcc522e3 in epoll_wait () from /lib64/libc.so.6
> #1 0x00007f2669eb97ba in Java_sun_nio_ch_EPollArrayWrapper_epollWait ()
> from /usr/java/jdk1.7.0_04/jre/lib/amd64/libnio.so
>
>
> (18) Threads with stack
> #0 0x00007f26bd31b989 in pthread_cond_timedwait@@GLIBC_2.3.2 () from
> /lib64/libpthread.so.0
> #1 0x00007f26bc598c27 in os::PlatformEvent::park(long) () from
> /usr/java/jdk1.7.0_04/jre/lib/amd64/server/libjvm.so
> #2 0x00007f26bc589880 in ObjectMonitor::wait(long, bool, Thread*) () from
> /usr/java/jdk1.7.0_04/jre/lib/amd64/server/libjvm.so
> #3 0x00007f26bc3febd1 in JVM_MonitorWait () from
> /usr/java/jdk1.7.0_04/jre/lib/amd64/server/libjvm.so
>
> (36) Threads with stack
> #0 0x00007f26bd31b989 in pthread_cond_timedwait@@GLIBC_2.3.2 () from
> /lib64/libpthread.so.0
> #1 0x00007f26bc59847f in Parker::park(bool, long) () from
> /usr/java/jdk1.7.0_04/jre/lib/amd64/server/libjvm.so
> #2 0x00007f26bc6d84ad in Unsafe_Park () from
> /usr/java/jdk1.7.0_04/jre/lib/amd64/server/libjvm.so
>
> (1) Thread with stack
>
> #0 0x00007f26bd31da00 in sem_wait () from /lib64/libpthread.so.0
> #1 0x00007f26bc59881a in check_pending_signals(bool) () from
> /usr/java/jdk1.7.0_04/jre/lib/amd64/server/libjvm.so
> #2 0x00007f26bc592945 in signal_thread_entry(JavaThread*, Thread*) ()
> from /usr/java/jdk1.7.0_04/jre/lib/amd64/server/libjvm.so
> #3 0x00007f26bc6b8228 in JavaThread::thread_main_inner() () from
> /usr/java/jdk1.7.0_04/jre/lib/amd64/server/libjvm.so
> #4 0x00007f26bc6b8378 in JavaThread::run() () from
> /usr/java/jdk1.7.0_04/jre/lib/amd64/server/libjvm.so
> #5 0x00007f26bc599ff0 in java_start(Thread*) () from
> /usr/java/jdk1.7.0_04/jre/lib/amd64/server/libjvm.so
> #6 0x00007f26bd3177b6 in start_thread () from /lib64/libpthread.so.0
> #7 0x00007f26bcc51c5d in clone () from /lib64/libc.so.6
> #8 0x0000000000000000 in ?? ()
>
>
> Thanks,
> Kannan
>
> ________________________________________
> From: Krishnamurthy, Kannan
> Sent: Tuesday, August 26, 2014 2:09 PM
> To: Martin Makundi; Yu Zhang
> Cc: hotspot-gc-use at openjdk.java.net; kndkannan at gmail.com
> Subject: RE: Unexplained long stop the world pauses during concurrent
> marking step in G1 Collector
>
> Thanks for the responses. I agree with Jenny that the gc logs doesn't seem
> to indicate that the pause was from a GC event.
>
> Profiling the application and trying to get some native stack traces/ core
> dumps.
> ________________________________
> From: Martin Makundi [martin.makundi at koodaripalvelut.com]
> Sent: Monday, August 25, 2014 10:07 PM
> To: Yu Zhang
> Cc: Krishnamurthy, Kannan; hotspot-gc-use at openjdk.java.net;
> kndkannan at gmail.com
> Subject: Re: Unexplained long stop the world pauses during concurrent
> marking step in G1 Collector
>
> Try running command kill -3 <pid> to see what's going on with the
> application threads.
>
> **
> Martin
>
>
> 2014-08-26 4:48 GMT+03:00 Yu Zhang <yu.zhang at oracle.com<mailto:
> yu.zhang at oracle.com>>:
> Kannan,
>
> The concurrent marking is concurrent, meaning it runs concurrently with
> the application.
> You may see the time between [GC concurrent-mark-start] and
> [GC concurrent-mark-stop] very long, maybe even some young gc happened
> during this time period. This is because the marking threads can be
> suspended.
>
> >*2014-08-07T13:42:39.598-0400: 92192.348: Application time: 9.0448670
> seconds**
> 2014-08-07T13:42:39.601-0400: 92192.351: Total time for which
> application threads were stopped: 0.0029740 seconds
> This means the application has run for 9.0448670 seconds. And only being
> stopped for
> 0.0029740 seconds.
>
> From this log, gc did not stop the application.
>
> >Linux "top" shows single CPU at 100% and rest of the CPUs at 0% during
> the pause.
> Maybe something in the application or OS is running on 1 cpu and blocking
> other threads.
>
> Thanks,
> Jenny
>
> On 8/18/2014 1:34 PM, Krishnamurthy, Kannan wrote:
> Greetings,
>
> We are experiencing unexplained/unknown long pauses (8 seconds) during
> concurrent marking step of G1 collector.
>
> 2014-08-07T13:42:30.552-0400: 92183.303: [GC
> concurrent-root-region-scan-start]
> 2014-08-07T13:42:30.555-0400: 92183.305: [GC
> concurrent-root-region-scan-end, 0.0025230 secs]
> **2014-08-07T13:42:30.555-0400: 92183.305: [GC concurrent-mark-start]**
> **2014-08-07T13:42:39.598-0400: 92192.348: Application time: 9.0448670
> seconds**
> 2014-08-07T13:42:39.601-0400: 92192.351: Total time for which
> application threads were stopped: 0.0029740 seconds
> 2014-08-07T13:42:39.603-0400: 92192.353: [GC pause (G1 Evacuation
> Pause) (young) 92192.354: [G1Ergonomics (CSet Construction) start choosing
> CSet, _pending_cards: 7980, predicted base time: 28.19 ms, remaining time:
> 71.81 ms, target pause time: 100.00 ms
>
>
> `2014-08-07T13:42:30.555-0400: 92183.305` is when the concurrent mark
> starts, approximately after 2 seconds of this step the application starts
> to pause. However the GC logs claims the application was not paused during
> this window.
> Linux "top" shows single CPU at 100% and rest of the CPUs at 0% during the
> pause.
>
> Any help in understanding the root cause of this issue is appreciated.
>
> Our target JVMS:
>
> java version "1.7.0_04"
> Java(TM) SE Runtime Environment (build 1.7.0_04-b20)
> Java HotSpot(TM) 64-Bit Server VM (build 23.0-b21, mixed mode)
>
> java version "1.8.0_11"
> Java(TM) SE Runtime Environment (build 1.8.0_11-b12)
> Java HotSpot(TM) 64-Bit Server VM (build 25.11-b03, mixed mode)
>
>
> Our JVM options :
>
> -Xms20G -Xmx20G -Xss10M -XX:PermSize=128M -XX:MaxPermSize=128M
> -XX:MarkStackSize=16M
> -XX:+UnlockDiagnosticVMOptions -XX:+G1PrintRegionLivenessInfo
> -XX:+TraceGCTaskThread
> -XX:+G1SummarizeConcMark -XX:+G1SummarizeRSetStats
> -XX:+G1TraceConcRefinement
> -XX:+UseG1GC -XX:MaxGCPauseMillis=100
> -XX:InitiatingHeapOccupancyPercent=65
> -XX:ParallelGCThreads=24 -verbose:gc -XX:+PrintGC -XX:+PrintGCDetails
> -XX:+PrintGCDateStamps -XX:+PrintAdaptiveSizePolicy
> -XX:+PrintTenuringDistribution
> -XX:+PrintGCApplicationStoppedTime
> -XX:+PrintGCApplicationConcurrentTime
> -Xloggc:/common/logs/ocean-partition-gc.log
>
>
> Thanks and regards,
> Kannan
>
>
>
>
> _______________________________________________
> hotspot-gc-use mailing list
> hotspot-gc-use at openjdk.java.net<mailto:hotspot-gc-use at openjdk.java.net>
> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>
>
>
> _______________________________________________
> hotspot-gc-use mailing list
> hotspot-gc-use at openjdk.java.net<mailto:hotspot-gc-use at openjdk.java.net>
> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20140828/ff203325/attachment-0001.html>
More information about the hotspot-gc-use
mailing list