From ecki at zusammenkunft.net Mon Jun 1 11:37:42 2015 From: ecki at zusammenkunft.net (Bernd) Date: Mon, 1 Jun 2015 13:37:42 +0200 Subject: JVM taking a few seconds to reach a safepoint for routine young gen GC In-Reply-To: References: Message-ID: Yes they can continue, but they also have to. I not all threads have reached the safe poi t the action (gc) which has requested the safepoint cannot proceed and start with the work - as it is not safe. So basically the slowest thread determines the time it requires to reach the safepoint. Bernd Am 30.05.2015 19:35 schrieb "Gustav ?kesson" : > Hi, > > I thought that threads (blocked) in native was not a problem for > safe-pointing? Meaning that these threads could continue to execute/block > during an attempt to reach safe point, but denied to return back to > Java-land until safe-ponting is over. > > Best Regards, > Gustav ?kesson > Den 30 maj 2015 04:14 skrev "Srinivas Ramakrishna" : > >> Hi Jason -- >> >> You mentioned a lucene indexer on the same box. Can you check for >> correlation between the indexing activity, paging behavior and the >> incidence of the long safe points? >> >> -- ramki >> >> ysr1729 >> >> On May 29, 2015, at 09:51, Jason Goetz wrote: >> >> Oops, I did not intend to remove this from the list. Re-added. >> >> I?ll take a look at how many RUNNABLE threads are actually blocked in >> native code. I?ll also look at VSphere to see if I can see anything unusual >> around resource contention. >> >> I?ve grepped the safepoint logs for GenCollectForAllocation, which, as I >> mentioned before, happen about every second, but only show this long sync >> times during the mysterious 20-minute period. I?ve taken an excerpt from >> one of these 20-minute pauses. You can see that for most GC the only time >> is in vmop and sync time is 0, but during these pauses the ?sync? time >> takes up the majority of the time. >> >> [threads: total initially_running wait_to_block] [time: spin block >> sync cleanup vmop] page_trap_count >> >> 116546.586: GenCollectForAllocation [ 146 5 >> 7 ] [ 0 0 0 1 50 ] 0 >> 116546.891: GenCollectForAllocation [ 146 2 >> 7 ] [ 0 0 0 2 50 ] 0 >> 116547.969: GenCollectForAllocation [ 145 0 >> 2 ] [ 0 0 0 2 290 ] 0 >> 116549.500: GenCollectForAllocation [ 145 0 >> 0 ] [ 0 0 0 2 67 ] 0 >> 116550.836: GenCollectForAllocation [ 142 0 >> 1 ] [ 0 0 0 1 82 ] 0 >> 116553.398: GenCollectForAllocation [ 142 0 >> 2 ] [ 0 0 0 2 76 ] 0 >> 116555.109: GenCollectForAllocation [ 142 0 >> 0 ] [ 0 0 0 2 84 ] 0 >> 116557.328: GenCollectForAllocation [ 142 0 >> 0 ] [ 0 0 0 2 64 ] 0 >> 116561.992: GenCollectForAllocation [ 143 2 >> 1 ] [ 0 0 523 2 76 ] 1 >> 116567.367: GenCollectForAllocation [ 143 1 >> 0 ] [ 1 0 39 2 104 ] 0 >> 116572.438: GenCollectForAllocation [ 143 4 >> 3 ] [ 0 0 0 2 85 ] 0 >> 116575.977: GenCollectForAllocation [ 144 76 >> 1 ] [ 24 0154039 9 181 ] 0 >> 116731.336: GenCollectForAllocation [ 353 41 >> 5 ] [ 5 0 5 1 101 ] 0 >> 116732.328: GenCollectForAllocation [ 354 5 >> 16 ] [ 2080 0 2115 1 0 ] 1 >> 116736.430: GenCollectForAllocation [ 354 5 >> 9 ] [ 0 0 0 1 81 ] 2 >> 116736.891: GenCollectForAllocation [ 354 0 >> 4 ] [ 0 0 0 4 88 ] 0 >> 116737.305: GenCollectForAllocation [ 354 2 >> 9 ] [ 0 0 0 1 80 ] 0 >> 116737.664: GenCollectForAllocation [ 354 1 >> 8 ] [ 0 0 0 2 65 ] 0 >> 116738.055: GenCollectForAllocation [ 355 1 >> 8 ] [ 0 0 0 1 106 ] 0 >> 116738.797: GenCollectForAllocation [ 354 0 >> 5 ] [ 0 0 2116 2 125 ] 0 >> 116741.523: GenCollectForAllocation [ 353 1 >> 0 ] [ 5 0 502 1 195 ] 0 >> 116743.219: GenCollectForAllocation [ 352 1 >> 5 ] [ 0 0 0 1 0 ] 0 >> 116743.719: GenCollectForAllocation [ 352 1 >> 7 ] [ 0 0 0 1 67 ] 0 >> 116744.266: GenCollectForAllocation [ 352 271 >> 0 ] [ 28 0764563 4 0 ] 0 >> 117509.914: GenCollectForAllocation [ 347 1 >> 2 ] [ 0 0 0 2 166 ] 0 >> 117510.609: GenCollectForAllocation [ 456 84 >> 9 ] [ 8 0 8 2 103 ] 1 >> 117511.305: GenCollectForAllocation [ 479 0 >> 6 ] [ 0 0 0 7 199 ] 0 >> 117512.086: GenCollectForAllocation [ 480 0 >> 2 ] [ 0 0 0 2 192 ] 0 >> 117829.000: GenCollectForAllocation [ 569 0 >> 3 ] [ 0 0 0 2 0 ] 0 >> 117829.000: GenCollectForAllocation [ 569 2 >> 5 ] [ 0 0 0 0 128 ] 0 >> 117829.523: GenCollectForAllocation [ 569 0 >> 6 ] [ 0 0 0 2 84 ] 0 >> 117830.039: GenCollectForAllocation [ 571 0 >> 5 ] [ 0 0 0 2 0 ] 0 >> 117830.781: GenCollectForAllocation [ 571 0 >> 6 ] [ 0 0 0 6 72 ] 0 >> 117831.461: GenCollectForAllocation [ 571 0 >> 4 ] [ 0 0 0 1 0 ] 0 >> 117831.469: GenCollectForAllocation [ 571 0 >> 3 ] [ 0 0 0 0 113 ] 0 >> >> From: Vitaly Davidovich >> Date: Thursday, May 28, 2015 at 4:20 PM >> To: Jason Goetz >> Subject: Re: JVM taking a few seconds to reach a safepoint for routine >> young gen GC >> >> Jason, >> >> Not sure if you meant to reply just to me, but you did :) >> >> So I suspect the RUNNABLE you list is what jstack gives you, which is >> slightly a lie since it'll show some threads blocked in native code as >> RUNNABLE. >> >> The fact that you're on a VM is biasing me towards looking at that >> angle. If there's a spike in runnable (from kernel scheduler standpoint) >> threads and/or contention for resources, and it's driven by hypervisor, I >> wonder if there're any artifacts in that. I don't have much experience >> running servers on VMs (only bare metal), so hard to say. You may want to >> reply to the list again and see if anyone else has more insight into this >> type of setup. >> >> Also, Poonam asked for safepoint statistics for the vm ops that were >> requested -- do you have those? >> >> On Thu, May 28, 2015 at 4:20 PM, Jason Goetz >> wrote: >> >>> I?m happy to answer whatever I can. Thanks for taking the time to help. >>> It?s running on a VM, not bare metal. The exact OS is Windows Server 2008. >>> The database is running on another machine. There is a very large Lucene >>> index on the same machine as the application and commits to this index are >>> frequent and often contended. >>> >>> From the thread dumps I took during these pauses (there are several that >>> happen around minor GCs during these 20-minute periods) I can see the >>> following stats: >>> >>> Dump 1: >>> Threads: 147 >>> RUNNABLE: 42 >>> WAITING: 30 >>> TIMED_WAITING: 75 >>> BLOCKED: 0 >>> >>> Dump 2: >>> Threads: 259 >>> RUNNABLE: 143 >>> WAITING: 47 >>> TIMED_WAITING: 62 >>> BLOCKED: 7 >>> >>> The only reason I believe the thread count is higher than usual on the >>> second dump is that the dump follows a very long pause (69 seconds, all >>> spent in sync time stopping threads for a safepoint) so I think there were >>> several web requests that gathered up during this pause and needed to be >>> served. >>> >>> As far as Unsafe operations, the only thing I see in thread dumps when I >>> grep for Unsafe is Unsafe.park operations in threads that are >>> TIMED_WAITING. >>> >>> As far as memory allocation, I do have some good profiling of that from >>> the flight recordings that are taken and have a listing of allocations by >>> thread. I haven?t been able to see any abnormal allocations happening >>> during the time of the pauses, and the total amount of memory being >>> allocated is no different during these pauses. In fact, the amount of >>> memory getting allocated (inside and outside TLABs) is less during these >>> pauses as I imagine the time that threads are waiting for a safepoint are >>> taking time away from running code that allocates memory. >>> >>> From: Vitaly Davidovich >>> Date: Thursday, May 28, 2015 at 12:06 PM >>> To: Jason Goetz >>> >>> Subject: Re: JVM taking a few seconds to reach a safepoint for routine >>> young gen GC >>> >>> Thanks Jason. Is this bare metal Windows or virtualized? Of the >>> 140-200 active, how many are runnable at the time of the stalls? >>> >>> Do you (or any used libs that you know of) use Unsafe for big memcpy >>> style operations? >>> >>> When these spikes occur, how many runnable procs are there on the >>> machine? Is there scheduling contention perhaps (with Tomcat?)? >>> >>> As for JNI, typically, java threads in JNI won't stall threads from >>> sync'ing on a safepoint. >>> >>> Sorry for the spanish inquisition, but may help us figure this out or at >>> least get a lead. >>> >>> On Thu, May 28, 2015 at 2:45 PM, Jason Goetz >>> wrote: >>> >>>> Vitaly, >>>> >>>> We?ve seen 140-200 active threads during the time of the stalls but >>>> that?s no different than any other time period. There are 12 CPUs available >>>> on the JVM and there is 24G in the heap, 64G on the machine. This is the >>>> only JVM running on the machine, which runs on a Windows server, and Tomcat >>>> is the only application of note other than a few monitoring tools (Zabbix, >>>> HP Open View, VMWare Tools), which I haven?t had the option of turning off). >>>> >>>> I?m not sure that JNI is running. We don?t explicitly have any JNI >>>> calls running, but I?m not sure about whether any of the 3rd-party >>>> libraries we use have JNI code that I?m unaware of. I haven?t been able to >>>> figure out how to identify if JNI calls are even running. We have taken >>>> several Java Flight Recordings around these every-20-minute pauses, but >>>> haven?t seen any patterns or unusual spikes in disk I/O, thread contention, >>>> or any thread activity. There is no swapping at all either. >>>> >>>> Any other information that I could provide in order to give a clearer >>>> picture of the system? >>>> >>>> Thanks, >>>> Jason >>>> >>>> From: Vitaly Davidovich >>>> Date: Thursday, May 28, 2015 at 11:17 AM >>>> To: Jason Goetz >>>> Cc: hotspot-gc-use >>>> Subject: Re: JVM taking a few seconds to reach a safepoint for routine >>>> young gen GC >>>> >>>> Jason, >>>> >>>> How many java threads are active when these stalls happen? How many >>>> CPUs are available to the jvm? How much physical memory on the machine? Is >>>> your jvm sole occupant of the machine or do you have noisy neighbors? You >>>> mentioned JNI - do you have a lot of JNI calls around these times? Do you >>>> allocate and/or write to large arrays/memory regions? Is there something >>>> different/interesting about these 20 min periods (e.g. workload increases, >>>> same time of day, more disk activity, any paging/swap activity, etc). >>>> >>>> sent from my phone >>>> On May 28, 2015 1:58 PM, "Jason Goetz" wrote: >>>> >>>>> We're consistently seeing a situation where threads take a few seconds >>>>> to stop for a routine GC. For 20 straight minutes the GC will run right >>>>> away (it runs about every second). But then, during a 20-minute period, the >>>>> threads will take longer to stop for GC. See the GC output below. >>>>> >>>>> 2015-05-28T12:14:51.205-0500: 54796.811: Total time for which >>>>> application threads were stopped: 0.1121233 seconds, Stopping threads took: >>>>> 0.0000908 seconds >>>>> 2015-05-28T12:15:00.331-0500: 54805.930: Total time for which >>>>> application threads were stopped: 0.0019384 seconds, Stopping threads took: >>>>> 0.0001106 seconds >>>>> 2015-05-28T12:15:06.572-0500: 54812.174: [GC concurrent-mark-end, >>>>> 28.4067370 secs] >>>>> 2015-05-28T12:15:09.786-0500: 54815.395: [GC remark >>>>> 2015-05-28T12:15:09.786-0500: 54815.396: [GC ref-proc, 0.0103603 secs], >>>>> 0.0709271 secs] >>>>> [Times: user=0.73 sys=0.00, real=0.08 secs] >>>>> *2015-05-28T12:15:09.864-0500: 54815.466: Total time for which >>>>> application threads were stopped: 3.2916224 seconds, Stopping threads took: >>>>> 3.2188032 seconds* >>>>> 2015-05-28T12:15:09.864-0500: 54815.467: [GC cleanup 20G->20G(30G), >>>>> 0.0451098 secs] >>>>> [Times: user=0.61 sys=0.00, real=0.05 secs] >>>>> 2015-05-28T12:15:09.910-0500: 54815.512: Total time for which >>>>> application threads were stopped: 0.0459803 seconds, Stopping threads took: >>>>> 0.0001950 seconds >>>>> >>>>> Turning on safepoint logging reveals that these stopping threads times >>>>> are taken up by safepoint ?sync? time. Taking thread dumps every second >>>>> around these pauses fail to show anything of note happening during this >>>>> time, but it?s my understanding that native code won?t necessarily show up >>>>> in thread dumps anyway given that they exit before the JVM reaches a >>>>> safepoint. >>>>> >>>>> Enabling PrintJNIGCStalls fails to show any logging around the 3 >>>>> second pause seen above. I highly suspected JNI but was surprise that I >>>>> didn?t see any logging about JNI Weak References after turning that option >>>>> on. Any ideas for what I can try next? We?re using JDK 7u80. Here are the >>>>> rest of my JVM settings: >>>>> >>>>> DisableExplicitGC true >>>>> FlightRecorder true >>>>> GCLogFileSize 52428800 >>>>> ManagementServer true >>>>> MinHeapSize 25769803776 >>>>> MaxHeapSize 25769803776 >>>>> MaxPermSize 536870912 >>>>> NumberOfGCLogFiles 10 >>>>> PrintAdaptiveSizePolicy true >>>>> PrintGC true >>>>> PrintGCApplicationStoppedTime true >>>>> PrintGCCause true >>>>> PrintGCDateStamps true >>>>> PrintGCDetails true >>>>> PrintGCTimeStamps true >>>>> PrintSafepointStatistics true >>>>> PrintSafepointStatisticsCount 1 >>>>> PrintTenuringDistribution true >>>>> ReservedCodeCacheSize 268435456 >>>>> SafepointTimeout true >>>>> SafepointTimeoutDelay 4000 >>>>> ThreadStackSize 4096 >>>>> UnlockCommercialFeatures true >>>>> UseBiasedLocking false >>>>> UseGCLogFileRotation false >>>>> UseG1GC true >>>>> PrintJNIGCStalls true >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> hotspot-gc-use mailing list >>>>> hotspot-gc-use at openjdk.java.net >>>>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use >>>>> >>>>> >>> >> _______________________________________________ >> hotspot-gc-use mailing list >> hotspot-gc-use at openjdk.java.net >> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use >> >> >> _______________________________________________ >> hotspot-gc-use mailing list >> hotspot-gc-use at openjdk.java.net >> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use >> >> > _______________________________________________ > hotspot-gc-use mailing list > hotspot-gc-use at openjdk.java.net > http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ysr1729 at gmail.com Mon Jun 1 14:00:51 2015 From: ysr1729 at gmail.com (Srinivas Ramakrishna) Date: Mon, 1 Jun 2015 07:00:51 -0700 Subject: JVM taking a few seconds to reach a safepoint for routine young gen GC In-Reply-To: References: Message-ID: Hi Gustav -- What you stated is correct. However, there are non-native mode memory operations which can also be subject to page faults which can in turn affect the safe point sync time of these non-native threads. For one example, witness the recent discussion of perf data logging (to non-tmpfs in that case). -- Ramki ysr1729 > On May 30, 2015, at 10:05, Gustav ?kesson wrote: > > Hi, > > I thought that threads (blocked) in native was not a problem for safe-pointing? Meaning that these threads could continue to execute/block during an attempt to reach safe point, but denied to return back to Java-land until safe-ponting is over. > > Best Regards, > Gustav ?kesson > > Den 30 maj 2015 04:14 skrev "Srinivas Ramakrishna" : >> Hi Jason -- >> >> You mentioned a lucene indexer on the same box. Can you check for correlation between the indexing activity, paging behavior and the incidence of the long safe points? >> >> -- ramki >> >> ysr1729 >> >>> On May 29, 2015, at 09:51, Jason Goetz wrote: >>> >>> Oops, I did not intend to remove this from the list. Re-added. >>> >>> I?ll take a look at how many RUNNABLE threads are actually blocked in native code. I?ll also look at VSphere to see if I can see anything unusual around resource contention. >>> >>> I?ve grepped the safepoint logs for GenCollectForAllocation, which, as I mentioned before, happen about every second, but only show this long sync times during the mysterious 20-minute period. I?ve taken an excerpt from one of these 20-minute pauses. You can see that for most GC the only time is in vmop and sync time is 0, but during these pauses the ?sync? time takes up the majority of the time. >>> >>> [threads: total initially_running wait_to_block] [time: spin block sync cleanup vmop] page_trap_count >>> >>> 116546.586: GenCollectForAllocation [ 146 5 7 ] [ 0 0 0 1 50 ] 0 >>> 116546.891: GenCollectForAllocation [ 146 2 7 ] [ 0 0 0 2 50 ] 0 >>> 116547.969: GenCollectForAllocation [ 145 0 2 ] [ 0 0 0 2 290 ] 0 >>> 116549.500: GenCollectForAllocation [ 145 0 0 ] [ 0 0 0 2 67 ] 0 >>> 116550.836: GenCollectForAllocation [ 142 0 1 ] [ 0 0 0 1 82 ] 0 >>> 116553.398: GenCollectForAllocation [ 142 0 2 ] [ 0 0 0 2 76 ] 0 >>> 116555.109: GenCollectForAllocation [ 142 0 0 ] [ 0 0 0 2 84 ] 0 >>> 116557.328: GenCollectForAllocation [ 142 0 0 ] [ 0 0 0 2 64 ] 0 >>> 116561.992: GenCollectForAllocation [ 143 2 1 ] [ 0 0 523 2 76 ] 1 >>> 116567.367: GenCollectForAllocation [ 143 1 0 ] [ 1 0 39 2 104 ] 0 >>> 116572.438: GenCollectForAllocation [ 143 4 3 ] [ 0 0 0 2 85 ] 0 >>> 116575.977: GenCollectForAllocation [ 144 76 1 ] [ 24 0154039 9 181 ] 0 >>> 116731.336: GenCollectForAllocation [ 353 41 5 ] [ 5 0 5 1 101 ] 0 >>> 116732.328: GenCollectForAllocation [ 354 5 16 ] [ 2080 0 2115 1 0 ] 1 >>> 116736.430: GenCollectForAllocation [ 354 5 9 ] [ 0 0 0 1 81 ] 2 >>> 116736.891: GenCollectForAllocation [ 354 0 4 ] [ 0 0 0 4 88 ] 0 >>> 116737.305: GenCollectForAllocation [ 354 2 9 ] [ 0 0 0 1 80 ] 0 >>> 116737.664: GenCollectForAllocation [ 354 1 8 ] [ 0 0 0 2 65 ] 0 >>> 116738.055: GenCollectForAllocation [ 355 1 8 ] [ 0 0 0 1 106 ] 0 >>> 116738.797: GenCollectForAllocation [ 354 0 5 ] [ 0 0 2116 2 125 ] 0 >>> 116741.523: GenCollectForAllocation [ 353 1 0 ] [ 5 0 502 1 195 ] 0 >>> 116743.219: GenCollectForAllocation [ 352 1 5 ] [ 0 0 0 1 0 ] 0 >>> 116743.719: GenCollectForAllocation [ 352 1 7 ] [ 0 0 0 1 67 ] 0 >>> 116744.266: GenCollectForAllocation [ 352 271 0 ] [ 28 0764563 4 0 ] 0 >>> 117509.914: GenCollectForAllocation [ 347 1 2 ] [ 0 0 0 2 166 ] 0 >>> 117510.609: GenCollectForAllocation [ 456 84 9 ] [ 8 0 8 2 103 ] 1 >>> 117511.305: GenCollectForAllocation [ 479 0 6 ] [ 0 0 0 7 199 ] 0 >>> 117512.086: GenCollectForAllocation [ 480 0 2 ] [ 0 0 0 2 192 ] 0 >>> 117829.000: GenCollectForAllocation [ 569 0 3 ] [ 0 0 0 2 0 ] 0 >>> 117829.000: GenCollectForAllocation [ 569 2 5 ] [ 0 0 0 0 128 ] 0 >>> 117829.523: GenCollectForAllocation [ 569 0 6 ] [ 0 0 0 2 84 ] 0 >>> 117830.039: GenCollectForAllocation [ 571 0 5 ] [ 0 0 0 2 0 ] 0 >>> 117830.781: GenCollectForAllocation [ 571 0 6 ] [ 0 0 0 6 72 ] 0 >>> 117831.461: GenCollectForAllocation [ 571 0 4 ] [ 0 0 0 1 0 ] 0 >>> 117831.469: GenCollectForAllocation [ 571 0 3 ] [ 0 0 0 0 113 ] 0 >>> >>> From: Vitaly Davidovich >>> Date: Thursday, May 28, 2015 at 4:20 PM >>> To: Jason Goetz >>> Subject: Re: JVM taking a few seconds to reach a safepoint for routine young gen GC >>> >>> Jason, >>> >>> Not sure if you meant to reply just to me, but you did :) >>> >>> So I suspect the RUNNABLE you list is what jstack gives you, which is slightly a lie since it'll show some threads blocked in native code as RUNNABLE. >>> >>> The fact that you're on a VM is biasing me towards looking at that angle. If there's a spike in runnable (from kernel scheduler standpoint) threads and/or contention for resources, and it's driven by hypervisor, I wonder if there're any artifacts in that. I don't have much experience running servers on VMs (only bare metal), so hard to say. You may want to reply to the list again and see if anyone else has more insight into this type of setup. >>> >>> Also, Poonam asked for safepoint statistics for the vm ops that were requested -- do you have those? >>> >>>> On Thu, May 28, 2015 at 4:20 PM, Jason Goetz wrote: >>>> I?m happy to answer whatever I can. Thanks for taking the time to help. It?s running on a VM, not bare metal. The exact OS is Windows Server 2008. The database is running on another machine. There is a very large Lucene index on the same machine as the application and commits to this index are frequent and often contended. >>>> >>>> From the thread dumps I took during these pauses (there are several that happen around minor GCs during these 20-minute periods) I can see the following stats: >>>> >>>> Dump 1: >>>> Threads: 147 >>>> RUNNABLE: 42 >>>> WAITING: 30 >>>> TIMED_WAITING: 75 >>>> BLOCKED: 0 >>>> >>>> Dump 2: >>>> Threads: 259 >>>> RUNNABLE: 143 >>>> WAITING: 47 >>>> TIMED_WAITING: 62 >>>> BLOCKED: 7 >>>> >>>> The only reason I believe the thread count is higher than usual on the second dump is that the dump follows a very long pause (69 seconds, all spent in sync time stopping threads for a safepoint) so I think there were several web requests that gathered up during this pause and needed to be served. >>>> >>>> As far as Unsafe operations, the only thing I see in thread dumps when I grep for Unsafe is Unsafe.park operations in threads that are TIMED_WAITING. >>>> >>>> As far as memory allocation, I do have some good profiling of that from the flight recordings that are taken and have a listing of allocations by thread. I haven?t been able to see any abnormal allocations happening during the time of the pauses, and the total amount of memory being allocated is no different during these pauses. In fact, the amount of memory getting allocated (inside and outside TLABs) is less during these pauses as I imagine the time that threads are waiting for a safepoint are taking time away from running code that allocates memory. >>>> >>>> From: Vitaly Davidovich >>>> Date: Thursday, May 28, 2015 at 12:06 PM >>>> To: Jason Goetz >>>> >>>> Subject: Re: JVM taking a few seconds to reach a safepoint for routine young gen GC >>>> >>>> Thanks Jason. Is this bare metal Windows or virtualized? Of the 140-200 active, how many are runnable at the time of the stalls? >>>> >>>> Do you (or any used libs that you know of) use Unsafe for big memcpy style operations? >>>> >>>> When these spikes occur, how many runnable procs are there on the machine? Is there scheduling contention perhaps (with Tomcat?)? >>>> >>>> As for JNI, typically, java threads in JNI won't stall threads from sync'ing on a safepoint. >>>> >>>> Sorry for the spanish inquisition, but may help us figure this out or at least get a lead. >>>> >>>>> On Thu, May 28, 2015 at 2:45 PM, Jason Goetz wrote: >>>>> Vitaly, >>>>> >>>>> We?ve seen 140-200 active threads during the time of the stalls but that?s no different than any other time period. There are 12 CPUs available on the JVM and there is 24G in the heap, 64G on the machine. This is the only JVM running on the machine, which runs on a Windows server, and Tomcat is the only application of note other than a few monitoring tools (Zabbix, HP Open View, VMWare Tools), which I haven?t had the option of turning off). >>>>> >>>>> I?m not sure that JNI is running. We don?t explicitly have any JNI calls running, but I?m not sure about whether any of the 3rd-party libraries we use have JNI code that I?m unaware of. I haven?t been able to figure out how to identify if JNI calls are even running. We have taken several Java Flight Recordings around these every-20-minute pauses, but haven?t seen any patterns or unusual spikes in disk I/O, thread contention, or any thread activity. There is no swapping at all either. >>>>> >>>>> Any other information that I could provide in order to give a clearer picture of the system? >>>>> >>>>> Thanks, >>>>> Jason >>>>> >>>>> From: Vitaly Davidovich >>>>> Date: Thursday, May 28, 2015 at 11:17 AM >>>>> To: Jason Goetz >>>>> Cc: hotspot-gc-use >>>>> Subject: Re: JVM taking a few seconds to reach a safepoint for routine young gen GC >>>>> >>>>> Jason, >>>>> >>>>> How many java threads are active when these stalls happen? How many CPUs are available to the jvm? How much physical memory on the machine? Is your jvm sole occupant of the machine or do you have noisy neighbors? You mentioned JNI - do you have a lot of JNI calls around these times? Do you allocate and/or write to large arrays/memory regions? Is there something different/interesting about these 20 min periods (e.g. workload increases, same time of day, more disk activity, any paging/swap activity, etc). >>>>> >>>>> sent from my phone >>>>> >>>>>> On May 28, 2015 1:58 PM, "Jason Goetz" wrote: >>>>>> We're consistently seeing a situation where threads take a few seconds to stop for a routine GC. For 20 straight minutes the GC will run right away (it runs about every second). But then, during a 20-minute period, the threads will take longer to stop for GC. See the GC output below. >>>>>> >>>>>> 2015-05-28T12:14:51.205-0500: 54796.811: Total time for which application threads were stopped: 0.1121233 seconds, Stopping threads took: 0.0000908 seconds >>>>>> 2015-05-28T12:15:00.331-0500: 54805.930: Total time for which application threads were stopped: 0.0019384 seconds, Stopping threads took: 0.0001106 seconds >>>>>> 2015-05-28T12:15:06.572-0500: 54812.174: [GC concurrent-mark-end, 28.4067370 secs] >>>>>> 2015-05-28T12:15:09.786-0500: 54815.395: [GC remark 2015-05-28T12:15:09.786-0500: 54815.396: [GC ref-proc, 0.0103603 secs], 0.0709271 secs] >>>>>> [Times: user=0.73 sys=0.00, real=0.08 secs] >>>>>> 2015-05-28T12:15:09.864-0500: 54815.466: Total time for which application threads were stopped: 3.2916224 seconds, Stopping threads took: 3.2188032 seconds >>>>>> 2015-05-28T12:15:09.864-0500: 54815.467: [GC cleanup 20G->20G(30G), 0.0451098 secs] >>>>>> [Times: user=0.61 sys=0.00, real=0.05 secs] >>>>>> 2015-05-28T12:15:09.910-0500: 54815.512: Total time for which application threads were stopped: 0.0459803 seconds, Stopping threads took: 0.0001950 seconds >>>>>> >>>>>> Turning on safepoint logging reveals that these stopping threads times are taken up by safepoint ?sync? time. Taking thread dumps every second around these pauses fail to show anything of note happening during this time, but it?s my understanding that native code won?t necessarily show up in thread dumps anyway given that they exit before the JVM reaches a safepoint. >>>>>> >>>>>> Enabling PrintJNIGCStalls fails to show any logging around the 3 second pause seen above. I highly suspected JNI but was surprise that I didn?t see any logging about JNI Weak References after turning that option on. Any ideas for what I can try next? We?re using JDK 7u80. Here are the rest of my JVM settings: >>>>>> >>>>>> DisableExplicitGC true >>>>>> FlightRecorder true >>>>>> GCLogFileSize 52428800 >>>>>> ManagementServer true >>>>>> MinHeapSize 25769803776 >>>>>> MaxHeapSize 25769803776 >>>>>> MaxPermSize 536870912 >>>>>> NumberOfGCLogFiles 10 >>>>>> PrintAdaptiveSizePolicy true >>>>>> PrintGC true >>>>>> PrintGCApplicationStoppedTime true >>>>>> PrintGCCause true >>>>>> PrintGCDateStamps true >>>>>> PrintGCDetails true >>>>>> PrintGCTimeStamps true >>>>>> PrintSafepointStatistics true >>>>>> PrintSafepointStatisticsCount 1 >>>>>> PrintTenuringDistribution true >>>>>> ReservedCodeCacheSize 268435456 >>>>>> SafepointTimeout true >>>>>> SafepointTimeoutDelay 4000 >>>>>> ThreadStackSize 4096 >>>>>> UnlockCommercialFeatures true >>>>>> UseBiasedLocking false >>>>>> UseGCLogFileRotation false >>>>>> UseG1GC true >>>>>> PrintJNIGCStalls true >>>>>> >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> hotspot-gc-use mailing list >>>>>> hotspot-gc-use at openjdk.java.net >>>>>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use >>> >>> _______________________________________________ >>> hotspot-gc-use mailing list >>> hotspot-gc-use at openjdk.java.net >>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use >> >> _______________________________________________ >> hotspot-gc-use mailing list >> hotspot-gc-use at openjdk.java.net >> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use -------------- next part -------------- An HTML attachment was scrubbed... URL: From yiyeguhu at gmail.com Wed Jun 24 00:42:19 2015 From: yiyeguhu at gmail.com (Tao Mao) Date: Tue, 23 Jun 2015 17:42:19 -0700 Subject: Long Reference Processing Time In-Reply-To: References: <555D0C29.8000301@oracle.com> <913945159.3879136.1432161687674.JavaMail.yahoo@mail.yahoo.com> Message-ID: Or, give Java Mission Control a try! -Tao On Wed, May 20, 2015 at 3:52 PM, Simone Bordet wrote: > Hi, > > On Thu, May 21, 2015 at 12:41 AM, Joy Xiong wrote: > > Is there other ways for this? It's a prod environment and it would be too > > intrusive for a heap dump... > > We used our solution in production too, by enabling it for few minutes > to collect data (via JMX) and then disabling it until the next restart > (also via JMX), where it was removed. > Required 2 restarts: one to add the instrumentation, and one to remove it. > > -- > Simone Bordet > http://bordet.blogspot.com > --- > Finally, no matter how good the architecture and design are, > to deliver bug-free software with optimal performance and reliability, > the implementation technique must be flawless. Victoria Livschitz > _______________________________________________ > hotspot-gc-use mailing list > hotspot-gc-use at openjdk.java.net > http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use > -------------- next part -------------- An HTML attachment was scrubbed... URL: