Unexplained long stop the world pauses during concurrent marking step in G1 Collector

Krishnamurthy, Kannan Kannan.Krishnamurthy at contractor.cengage.com
Thu Aug 28 19:38:27 UTC 2014


We managed to get couple of native stack traces when the pause was happening. 

>From the thread dump, we don't see any application thread blocking the JVM. We are wondering whether the JVM Safepoint and a single concurrent marking thread are blocking each other.

Links to the complete stack traces (~5.5MB each):
First : https://drive.google.com/file/d/0B95RUv-vjsAfcE95MlVmMVhKZTQ/edit?usp=sharing
Second : https://drive.google.com/file/d/0B95RUv-vjsAfNklxMnZVaEI2UTg/edit?usp=sharing

Most of the threads are in some kinds of waiting (pthread_cond_wait@@GLIBC_2.3.2, epoll_wait, or pthread_cond_timedwait). 

We see that only one thread was doing any work across two successive stack traces 2 seconds aparts.  For that thread 1820, in the first stack trace, we see:

Thread 1820 (Thread 0x7f266f2f5700 (LWP 17402)):
#0  0x00007f26bc55f604 in Monitor::ILock(Thread*) () from /usr/java/jdk1.7.0_04/jre/lib/amd64/server/libjvm.so
#1  0x00007f26bc55f9bf in Monitor::lock_without_safepoint_check() () from /usr/java/jdk1.7.0_04/jre/lib/amd64/server/libjvm.so
#2  0x00007f26bc229779 in CMTask::move_entries_to_global_stack() () from /usr/java/jdk1.7.0_04/jre/lib/amd64/server/libjvm.so
#3  0x00007f26bc230974 in CMTask::push(oopDesc*) () from /usr/java/jdk1.7.0_04/jre/lib/amd64/server/libjvm.so
#4  0x00007f26bc230b39 in CMTask::deal_with_reference(oopDesc*) () from /usr/java/jdk1.7.0_04/jre/lib/amd64/server/libjvm.so
#5  0x00007f26bc582ef7 in objArrayKlass::oop_oop_iterate_nv(oopDesc*, G1CMOopClosure*) () from /usr/java/jdk1.7.0_04/jre/lib/amd64/server/libjvm.so
#6  0x00007f26bc22abc5 in CMTask::drain_local_queue(bool) () from /usr/java/jdk1.7.0_04/jre/lib/amd64/server/libjvm.so
#7  0x00007f26bc233562 in CMBitMapClosure::do_bit(unsigned long) () from /usr/java/jdk1.7.0_04/jre/lib/amd64/server/libjvm.so
#8  0x00007f26bc22b955 in CMTask::do_marking_step(double, bool, bool) () from /usr/java/jdk1.7.0_04/jre/lib/amd64/server/libjvm.so
#9  0x00007f26bc232d8b in CMConcurrentMarkingTask::work(unsigned int) () from /usr/java/jdk1.7.0_04/jre/lib/amd64/server/libjvm.so
#10 0x00007f26bc70eaff in GangWorker::loop() () from /usr/java/jdk1.7.0_04/jre/lib/amd64/server/libjvm.so
#11 0x00007f26bc599ff0 in java_start(Thread*) () from /usr/java/jdk1.7.0_04/jre/lib/amd64/server/libjvm.so
#12 0x00007f26bd3177b6 in start_thread () from /lib64/libpthread.so.0
#13 0x00007f26bcc51c5d in clone () from /lib64/libc.so.6
#14 0x0000000000000000 in ?? ()
----------------------

For the same thread 1820 in the second trace, we see:

Thread 1820 (Thread 0x7f266f2f5700 (LWP 17402)):
#0  0x00007f26bc208080 in oopDesc::size() () from /usr/java/jdk1.7.0_04/jre/lib/amd64/server/libjvm.so
#1  0x00007f26bc230ace in CMTask::deal_with_reference(oopDesc*) () from /usr/java/jdk1.7.0_04/jre/lib/amd64/server/libjvm.so
#2  0x00007f26bc582ef7 in objArrayKlass::oop_oop_iterate_nv(oopDesc*, G1CMOopClosure*) () from /usr/java/jdk1.7.0_04/jre/lib/amd64/server/libjvm.so
#3  0x00007f26bc22abc5 in CMTask::drain_local_queue(bool) () from /usr/java/jdk1.7.0_04/jre/lib/amd64/server/libjvm.so
#4  0x00007f26bc233562 in CMBitMapClosure::do_bit(unsigned long) () from /usr/java/jdk1.7.0_04/jre/lib/amd64/server/libjvm.so
#5  0x00007f26bc22b955 in CMTask::do_marking_step(double, bool, bool) () from /usr/java/jdk1.7.0_04/jre/lib/amd64/server/libjvm.so
#6  0x00007f26bc232d8b in CMConcurrentMarkingTask::work(unsigned int) () from /usr/java/jdk1.7.0_04/jre/lib/amd64/server/libjvm.so
#7  0x00007f26bc70eaff in GangWorker::loop() () from /usr/java/jdk1.7.0_04/jre/lib/amd64/server/libjvm.so
#8  0x00007f26bc599ff0 in java_start(Thread*) () from /usr/java/jdk1.7.0_04/jre/lib/amd64/server/libjvm.so
#9  0x00007f26bd3177b6 in start_thread () from /lib64/libpthread.so.0
#10 0x00007f26bcc51c5d in clone () from /lib64/libc.so.6
#11 0x0000000000000000 in ?? ()


***************************************************************************************

The other thread interesting thread VMThread is trying to start a Safepoint Synchronize.

Thread 1816 (Thread 0x7f266ec26700 (LWP 17406)):
#0  0x00007f26bd31b61c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f26bc594103 in os::PlatformEvent::park() () from /usr/java/jdk1.7.0_04/jre/lib/amd64/server/libjvm.so
#2  0x00007f26bc55ff0f in Monitor::IWait(Thread*, long) () from /usr/java/jdk1.7.0_04/jre/lib/amd64/server/libjvm.so
#3  0x00007f26bc56069e in Monitor::wait(bool, long, bool) () from /usr/java/jdk1.7.0_04/jre/lib/amd64/server/libjvm.so
#4  0x00007f26bc2249b5 in ConcurrentGCThread::safepoint_synchronize() () from /usr/java/jdk1.7.0_04/jre/lib/amd64/server/libjvm.so
#5  0x00007f26bc61d4fb in SafepointSynchronize::begin() () from /usr/java/jdk1.7.0_04/jre/lib/amd64/server/libjvm.so
#6  0x00007f26bc700227 in VMThread::loop() () from /usr/java/jdk1.7.0_04/jre/lib/amd64/server/libjvm.so
#7  0x00007f26bc7008d0 in VMThread::run() () from /usr/java/jdk1.7.0_04/jre/lib/amd64/server/libjvm.so
#8  0x00007f26bc599ff0 in java_start(Thread*) () from /usr/java/jdk1.7.0_04/jre/lib/amd64/server/libjvm.so
#9  0x00007f26bd3177b6 in start_thread () from /lib64/libpthread.so.0
#10 0x00007f26bcc51c5d in clone () from /lib64/libc.so.6
#11 0x0000000000000000 in ?? ()

**********************************************
The complete break down of rest of threads as below:

(1632) Threads with stack 
Thread 1818 (Thread 0x7f266f0f3700 (LWP 17404)):
#0  0x00007f26bd31b61c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f26bc594103 in os::PlatformEvent::park() () from /usr/java/jdk1.7.0_04/jre/lib/amd64/server/libjvm.so
#2  0x00007f26bc55ff0f in Monitor::IWait(Thread*, long) () from /usr/java/jdk1.7.0_04/jre/lib/amd64/server/libjvm.so
#3  0x00007f26bc56069e in Monitor::wait(bool, long, bool) () from /usr/java/jdk1.7.0_04/jre/lib/amd64/server/libjvm.so


(156) Threads with stack 
Thread 10 (Thread 0x7f1cf6e31700 (LWP 973)):
#0  0x00007f26bd31b61c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f26bc598427 in Parker::park(bool, long) () from /usr/java/jdk1.7.0_04/jre/lib/amd64/server/libjvm.so
#2  0x00007f26bc6d84ad in Unsafe_Park () from /usr/java/jdk1.7.0_04/jre/lib/amd64/server/libjvm.so



(24) Threads with stack 
#0  0x00007f26bcc522e3 in epoll_wait () from /lib64/libc.so.6
#1  0x00007f2669eb97ba in Java_sun_nio_ch_EPollArrayWrapper_epollWait () from /usr/java/jdk1.7.0_04/jre/lib/amd64/libnio.so


(18) Threads with stack  
#0  0x00007f26bd31b989 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f26bc598c27 in os::PlatformEvent::park(long) () from /usr/java/jdk1.7.0_04/jre/lib/amd64/server/libjvm.so
#2  0x00007f26bc589880 in ObjectMonitor::wait(long, bool, Thread*) () from /usr/java/jdk1.7.0_04/jre/lib/amd64/server/libjvm.so
#3  0x00007f26bc3febd1 in JVM_MonitorWait () from /usr/java/jdk1.7.0_04/jre/lib/amd64/server/libjvm.so

(36) Threads with stack  
#0  0x00007f26bd31b989 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f26bc59847f in Parker::park(bool, long) () from /usr/java/jdk1.7.0_04/jre/lib/amd64/server/libjvm.so
#2  0x00007f26bc6d84ad in Unsafe_Park () from /usr/java/jdk1.7.0_04/jre/lib/amd64/server/libjvm.so

(1) Thread with stack 

#0  0x00007f26bd31da00 in sem_wait () from /lib64/libpthread.so.0
#1  0x00007f26bc59881a in check_pending_signals(bool) () from /usr/java/jdk1.7.0_04/jre/lib/amd64/server/libjvm.so
#2  0x00007f26bc592945 in signal_thread_entry(JavaThread*, Thread*) () from /usr/java/jdk1.7.0_04/jre/lib/amd64/server/libjvm.so
#3  0x00007f26bc6b8228 in JavaThread::thread_main_inner() () from /usr/java/jdk1.7.0_04/jre/lib/amd64/server/libjvm.so
#4  0x00007f26bc6b8378 in JavaThread::run() () from /usr/java/jdk1.7.0_04/jre/lib/amd64/server/libjvm.so
#5  0x00007f26bc599ff0 in java_start(Thread*) () from /usr/java/jdk1.7.0_04/jre/lib/amd64/server/libjvm.so
#6  0x00007f26bd3177b6 in start_thread () from /lib64/libpthread.so.0
#7  0x00007f26bcc51c5d in clone () from /lib64/libc.so.6
#8  0x0000000000000000 in ?? ()


Thanks,
Kannan

________________________________________
From: Krishnamurthy, Kannan
Sent: Tuesday, August 26, 2014 2:09 PM
To: Martin Makundi; Yu Zhang
Cc: hotspot-gc-use at openjdk.java.net; kndkannan at gmail.com
Subject: RE: Unexplained long stop the world pauses during concurrent marking step in G1 Collector

Thanks for the responses. I agree with Jenny that the gc logs doesn't seem to indicate that the pause was from a GC event.

Profiling the application and trying to get some native stack traces/ core dumps.
________________________________
From: Martin Makundi [martin.makundi at koodaripalvelut.com]
Sent: Monday, August 25, 2014 10:07 PM
To: Yu Zhang
Cc: Krishnamurthy, Kannan; hotspot-gc-use at openjdk.java.net; kndkannan at gmail.com
Subject: Re: Unexplained long stop the world pauses during concurrent marking step in G1 Collector

Try running command kill -3 <pid> to see what's going on with the application threads.

**
Martin


2014-08-26 4:48 GMT+03:00 Yu Zhang <yu.zhang at oracle.com<mailto:yu.zhang at oracle.com>>:
Kannan,

The concurrent marking is concurrent, meaning it runs concurrently with the application.
You may see the time between [GC concurrent-mark-start] and
[GC concurrent-mark-stop] very long, maybe even some young gc happened during this time period.  This is because the marking threads can be suspended.

>*2014-08-07T13:42:39.598-0400: 92192.348: Application time: 9.0448670 seconds**
    2014-08-07T13:42:39.601-0400: 92192.351: Total time for which application threads were stopped: 0.0029740 seconds
This means the application has run for 9.0448670 seconds.  And only being stopped for
0.0029740 seconds.

>From this log, gc did not stop the application.

>Linux "top" shows single CPU at 100% and rest of the CPUs at 0% during the pause.
Maybe something in the application or OS is running on 1 cpu and blocking other threads.

Thanks,
Jenny

On 8/18/2014 1:34 PM, Krishnamurthy, Kannan wrote:
Greetings,

We are experiencing unexplained/unknown long pauses (8 seconds) during concurrent marking step of G1 collector.

    2014-08-07T13:42:30.552-0400: 92183.303: [GC concurrent-root-region-scan-start]
    2014-08-07T13:42:30.555-0400: 92183.305: [GC concurrent-root-region-scan-end, 0.0025230 secs]
    **2014-08-07T13:42:30.555-0400: 92183.305: [GC concurrent-mark-start]**
    **2014-08-07T13:42:39.598-0400: 92192.348: Application time: 9.0448670 seconds**
    2014-08-07T13:42:39.601-0400: 92192.351: Total time for which application threads were stopped: 0.0029740 seconds
    2014-08-07T13:42:39.603-0400: 92192.353: [GC pause (G1 Evacuation Pause) (young) 92192.354: [G1Ergonomics (CSet Construction) start choosing CSet, _pending_cards: 7980, predicted base time: 28.19 ms, remaining time: 71.81 ms, target pause time: 100.00 ms


`2014-08-07T13:42:30.555-0400: 92183.305` is when the concurrent mark starts, approximately after 2 seconds of this step the application starts to pause. However the GC logs claims the application was not paused during this window.
Linux "top" shows single CPU at 100% and rest of the CPUs at 0% during the pause.

Any help in understanding the root cause of this issue is appreciated.

Our target JVMS:

    java version "1.7.0_04"
    Java(TM) SE Runtime Environment (build 1.7.0_04-b20)
    Java HotSpot(TM) 64-Bit Server VM (build 23.0-b21, mixed mode)

    java version "1.8.0_11"
    Java(TM) SE Runtime Environment (build 1.8.0_11-b12)
    Java HotSpot(TM) 64-Bit Server VM (build 25.11-b03, mixed mode)


Our JVM options :

    -Xms20G -Xmx20G -Xss10M -XX:PermSize=128M -XX:MaxPermSize=128M -XX:MarkStackSize=16M
    -XX:+UnlockDiagnosticVMOptions -XX:+G1PrintRegionLivenessInfo -XX:+TraceGCTaskThread
    -XX:+G1SummarizeConcMark -XX:+G1SummarizeRSetStats -XX:+G1TraceConcRefinement
    -XX:+UseG1GC -XX:MaxGCPauseMillis=100 -XX:InitiatingHeapOccupancyPercent=65
    -XX:ParallelGCThreads=24 -verbose:gc -XX:+PrintGC -XX:+PrintGCDetails
    -XX:+PrintGCDateStamps -XX:+PrintAdaptiveSizePolicy -XX:+PrintTenuringDistribution
    -XX:+PrintGCApplicationStoppedTime -XX:+PrintGCApplicationConcurrentTime
    -Xloggc:/common/logs/ocean-partition-gc.log


Thanks and regards,
Kannan




_______________________________________________
hotspot-gc-use mailing list
hotspot-gc-use at openjdk.java.net<mailto:hotspot-gc-use at openjdk.java.net>
http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use



_______________________________________________
hotspot-gc-use mailing list
hotspot-gc-use at openjdk.java.net<mailto:hotspot-gc-use at openjdk.java.net>
http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use




More information about the hotspot-gc-use mailing list