JVM taking a few seconds to reach a safepoint for routine young gen GC

Sat May 30 02:14:10 UTC 2015

Hi Jason --

You mentioned a lucene indexer on the same box. Can you check for correlation between the indexing activity, paging behavior and the incidence of the long safe points?

-- ramki

ysr1729

> On May 29, 2015, at 09:51, Jason Goetz <jason.goetz at gmail.com> wrote:
> 
> Oops, I did not intend to remove this from the list. Re-added.
> 
> I’ll take a look at how many RUNNABLE threads are actually blocked in native code. I’ll also look at VSphere to see if I can see anything unusual around resource contention.
> 
> I’ve grepped the safepoint logs for GenCollectForAllocation, which, as I mentioned before, happen about every second, but only show this long sync times during the mysterious 20-minute period. I’ve taken an excerpt from one of these 20-minute pauses. You can see that for most GC the only time is in vmop and sync time is 0, but during these pauses the ’sync’ time takes up the majority of the time.
> 
> [threads: total initially_running wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
> 
> 116546.586: GenCollectForAllocation          [     146          5              7    ]      [     0     0     0     1    50    ]  0
> 116546.891: GenCollectForAllocation          [     146          2              7    ]      [     0     0     0     2    50    ]  0
> 116547.969: GenCollectForAllocation          [     145          0              2    ]      [     0     0     0     2   290    ]  0
> 116549.500: GenCollectForAllocation          [     145          0              0    ]      [     0     0     0     2    67    ]  0
> 116550.836: GenCollectForAllocation          [     142          0              1    ]      [     0     0     0     1    82    ]  0
> 116553.398: GenCollectForAllocation          [     142          0              2    ]      [     0     0     0     2    76    ]  0
> 116555.109: GenCollectForAllocation          [     142          0              0    ]      [     0     0     0     2    84    ]  0
> 116557.328: GenCollectForAllocation          [     142          0              0    ]      [     0     0     0     2    64    ]  0
> 116561.992: GenCollectForAllocation          [     143          2              1    ]      [     0     0   523     2    76    ]  1
> 116567.367: GenCollectForAllocation          [     143          1              0    ]      [     1     0    39     2   104    ]  0
> 116572.438: GenCollectForAllocation          [     143          4              3    ]      [     0     0     0     2    85    ]  0
> 116575.977: GenCollectForAllocation          [     144         76              1    ]      [    24     0154039     9   181    ]  0
> 116731.336: GenCollectForAllocation          [     353         41              5    ]      [     5     0     5     1   101    ]  0
> 116732.328: GenCollectForAllocation          [     354          5             16    ]      [  2080     0  2115     1     0    ]  1
> 116736.430: GenCollectForAllocation          [     354          5              9    ]      [     0     0     0     1    81    ]  2
> 116736.891: GenCollectForAllocation          [     354          0              4    ]      [     0     0     0     4    88    ]  0
> 116737.305: GenCollectForAllocation          [     354          2              9    ]      [     0     0     0     1    80    ]  0
> 116737.664: GenCollectForAllocation          [     354          1              8    ]      [     0     0     0     2    65    ]  0
> 116738.055: GenCollectForAllocation          [     355          1              8    ]      [     0     0     0     1   106    ]  0
> 116738.797: GenCollectForAllocation          [     354          0              5    ]      [     0     0  2116     2   125    ]  0
> 116741.523: GenCollectForAllocation          [     353          1              0    ]      [     5     0   502     1   195    ]  0
> 116743.219: GenCollectForAllocation          [     352          1              5    ]      [     0     0     0     1     0    ]  0
> 116743.719: GenCollectForAllocation          [     352          1              7    ]      [     0     0     0     1    67    ]  0
> 116744.266: GenCollectForAllocation          [     352        271              0    ]      [    28     0764563     4     0    ]  0
> 117509.914: GenCollectForAllocation          [     347          1              2    ]      [     0     0     0     2   166    ]  0
> 117510.609: GenCollectForAllocation          [     456         84              9    ]      [     8     0     8     2   103    ]  1
> 117511.305: GenCollectForAllocation          [     479          0              6    ]      [     0     0     0     7   199    ]  0
> 117512.086: GenCollectForAllocation          [     480          0              2    ]      [     0     0     0     2   192    ]  0
> 117829.000: GenCollectForAllocation          [     569          0              3    ]      [     0     0     0     2     0    ]  0
> 117829.000: GenCollectForAllocation          [     569          2              5    ]      [     0     0     0     0   128    ]  0
> 117829.523: GenCollectForAllocation          [     569          0              6    ]      [     0     0     0     2    84    ]  0
> 117830.039: GenCollectForAllocation          [     571          0              5    ]      [     0     0     0     2     0    ]  0
> 117830.781: GenCollectForAllocation          [     571          0              6    ]      [     0     0     0     6    72    ]  0
> 117831.461: GenCollectForAllocation          [     571          0              4    ]      [     0     0     0     1     0    ]  0
> 117831.469: GenCollectForAllocation          [     571          0              3    ]      [     0     0     0     0   113    ]  0
> 
> From: Vitaly Davidovich <vitalyd at gmail.com>
> Date: Thursday, May 28, 2015 at 4:20 PM
> To: Jason Goetz <jason.goetz at gmail.com>
> Subject: Re: JVM taking a few seconds to reach a safepoint for routine young gen GC
> 
> Jason,
> 
> Not sure if you meant to reply just to me, but you did :)
> 
> So I suspect the RUNNABLE you list is what jstack gives you, which is slightly a lie since it'll show some threads blocked in native code as RUNNABLE.
> 
> The fact that you're on a VM is biasing me towards looking at that angle.  If there's a spike in runnable (from kernel scheduler standpoint) threads and/or contention for resources, and it's driven by hypervisor, I wonder if there're any artifacts in that.  I don't have much experience running servers on VMs (only bare metal), so hard to say.  You may want to reply to the list again and see if anyone else has more insight into this type of setup.
> 
> Also, Poonam asked for safepoint statistics for the vm ops that were requested -- do you have those?
> 
>> On Thu, May 28, 2015 at 4:20 PM, Jason Goetz <jason.goetz at gmail.com> wrote:
>> I’m happy to answer whatever I can. Thanks for taking the time to help. It’s running on a VM, not bare metal. The exact OS is Windows Server 2008. The database is running on another machine. There is a very large Lucene index on the same machine as the application and commits to this index are frequent and often contended.
>> 
>> From the thread dumps I took during these pauses (there are several that happen around minor GCs during these 20-minute periods) I can see the following stats:
>> 
>> Dump 1:
>> 	Threads: 147
>> 	RUNNABLE: 42
>> 	WAITING: 30
>> 	TIMED_WAITING: 75
>> 	BLOCKED: 0
>> 
>> Dump 2:
>> 	 Threads: 259
>> 	RUNNABLE: 143
>> 	WAITING: 47
>> 	TIMED_WAITING: 62
>> 	BLOCKED: 7
>> 
>> The only reason I believe the thread count is higher than usual on the second dump is that the dump follows a very long pause (69 seconds, all spent in sync time stopping threads for a safepoint) so I think there were several web requests that gathered up during this pause and needed to be served. 
>> 
>> As far as Unsafe operations, the only thing I see in thread dumps when I grep for Unsafe is Unsafe.park operations in threads that are TIMED_WAITING. 
>> 
>> As far as memory allocation, I do have some good profiling of that from the flight recordings that are taken and have a listing of allocations by thread. I haven’t been able to see any abnormal allocations happening during the time of the pauses, and the total amount of memory being allocated is no different during these pauses. In fact, the amount of memory getting allocated (inside and outside TLABs) is less during these pauses as I imagine the time that threads are waiting for a safepoint are taking time away from running code that allocates memory.
>> 
>> From: Vitaly Davidovich <vitalyd at gmail.com>
>> Date: Thursday, May 28, 2015 at 12:06 PM
>> To: Jason Goetz <jason.goetz at gmail.com>
>> 
>> Subject: Re: JVM taking a few seconds to reach a safepoint for routine young gen GC
>> 
>> Thanks Jason.  Is this bare metal Windows or virtualized?  Of the 140-200 active, how many are runnable at the time of the stalls?
>> 
>> Do you (or any used libs that you know of) use Unsafe for big memcpy style operations?
>> 
>> When these spikes occur, how many runnable procs are there on the machine? Is there scheduling contention perhaps (with Tomcat?)?
>> 
>> As for JNI, typically, java threads in JNI won't stall threads from sync'ing on a safepoint.
>> 
>> Sorry for the spanish inquisition, but may help us figure this out or at least get a lead.
>> 
>>> On Thu, May 28, 2015 at 2:45 PM, Jason Goetz <jason.goetz at gmail.com> wrote:
>>> Vitaly,
>>> 
>>> We’ve seen 140-200 active threads during the time of the stalls but that’s no different than any other time period. There are 12 CPUs available on the JVM and there is 24G in the heap, 64G on the machine.  This is the only JVM running on the machine, which runs on a Windows server, and Tomcat is the only application of note other than a few monitoring tools (Zabbix, HP Open View, VMWare Tools), which I haven’t had the option of turning off).
>>> 
>>> I’m not sure that JNI is running. We don’t explicitly have any JNI calls running, but I’m not sure about whether any of the 3rd-party libraries we use have JNI code that I’m unaware of. I haven’t been able to figure out how to identify if JNI calls are even running. We have taken several Java Flight Recordings around these every-20-minute pauses, but haven’t seen any patterns or unusual spikes in disk I/O, thread contention, or any thread activity. There is no swapping at all either.
>>> 
>>> Any other information that I could provide in order to give a clearer picture of the system?
>>> 
>>> Thanks,
>>> Jason
>>>  
>>> From: Vitaly Davidovich <vitalyd at gmail.com>
>>> Date: Thursday, May 28, 2015 at 11:17 AM
>>> To: Jason Goetz <jason.goetz at gmail.com>
>>> Cc: hotspot-gc-use <hotspot-gc-use at openjdk.java.net>
>>> Subject: Re: JVM taking a few seconds to reach a safepoint for routine young gen GC
>>> 
>>> Jason,
>>> 
>>> How many java threads are active when these stalls happen? How many CPUs are available to the jvm? How much physical memory on the machine? Is your jvm sole occupant of the machine or do you have noisy neighbors? You mentioned JNI - do you have a lot of JNI calls around these times? Do you allocate and/or write to large arrays/memory regions? Is there something different/interesting about these 20 min periods (e.g. workload increases, same time of day, more disk activity, any paging/swap activity, etc).
>>> 
>>> sent from my phone
>>> 
>>>> On May 28, 2015 1:58 PM, "Jason Goetz" <jason.goetz at gmail.com> wrote:
>>>> We're consistently seeing a situation where threads take a few seconds to stop for a routine GC. For 20 straight minutes the GC will run right away (it runs about every second). But then, during a 20-minute period, the threads will take longer to stop for GC. See the GC output below.
>>>> 
>>>> 2015-05-28T12:14:51.205-0500: 54796.811: Total time for which application threads were stopped: 0.1121233 seconds, Stopping threads took: 0.0000908 seconds
>>>> 2015-05-28T12:15:00.331-0500: 54805.930: Total time for which application threads were stopped: 0.0019384 seconds, Stopping threads took: 0.0001106 seconds
>>>> 2015-05-28T12:15:06.572-0500: 54812.174: [GC concurrent-mark-end, 28.4067370 secs]
>>>> 2015-05-28T12:15:09.786-0500: 54815.395: [GC remark 2015-05-28T12:15:09.786-0500: 54815.396: [GC ref-proc, 0.0103603 secs], 0.0709271 secs]
>>>>  [Times: user=0.73 sys=0.00, real=0.08 secs] 
>>>> 2015-05-28T12:15:09.864-0500: 54815.466: Total time for which application threads were stopped: 3.2916224 seconds, Stopping threads took: 3.2188032 seconds
>>>> 2015-05-28T12:15:09.864-0500: 54815.467: [GC cleanup 20G->20G(30G), 0.0451098 secs]
>>>>  [Times: user=0.61 sys=0.00, real=0.05 secs] 
>>>> 2015-05-28T12:15:09.910-0500: 54815.512: Total time for which application threads were stopped: 0.0459803 seconds, Stopping threads took: 0.0001950 seconds
>>>> 
>>>> Turning on safepoint logging reveals that these stopping threads times are taken up by safepoint ‘sync’ time. Taking thread dumps every second around these pauses fail to show anything of note happening during this time, but it’s my understanding that native code won’t necessarily show up in thread dumps anyway given that they exit before the JVM reaches a safepoint.
>>>> 
>>>> Enabling PrintJNIGCStalls fails to show any logging around the 3 second pause seen above. I highly suspected JNI but was surprise that I didn’t see any logging about JNI Weak References after turning that option on. Any ideas for what I can try next? We’re using JDK 7u80. Here are the rest of my JVM settings:
>>>> 
>>>> DisableExplicitGC					true
>>>> FlightRecorder					true
>>>> GCLogFileSize					52428800
>>>> ManagementServer				true
>>>> MinHeapSize						25769803776
>>>> MaxHeapSize					25769803776
>>>> MaxPermSize					536870912
>>>> NumberOfGCLogFiles				10
>>>> PrintAdaptiveSizePolicy			true
>>>> PrintGC							true
>>>> PrintGCApplicationStoppedTime	true
>>>> PrintGCCause					true
>>>> PrintGCDateStamps				true
>>>> PrintGCDetails					true
>>>> PrintGCTimeStamps				true
>>>> PrintSafepointStatistics			true
>>>> PrintSafepointStatisticsCount		1
>>>> PrintTenuringDistribution			true
>>>> ReservedCodeCacheSize			268435456
>>>> SafepointTimeout				true
>>>> SafepointTimeoutDelay			4000
>>>> ThreadStackSize					4096
>>>> UnlockCommercialFeatures		true
>>>> UseBiasedLocking				false
>>>> UseGCLogFileRotation			false
>>>> UseG1GC						true
>>>> PrintJNIGCStalls					true
>>>> 
>>>> 
>>>> 
>>>> _______________________________________________
>>>> hotspot-gc-use mailing list
>>>> hotspot-gc-use at openjdk.java.net
>>>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
> 
> _______________________________________________
> hotspot-gc-use mailing list
> hotspot-gc-use at openjdk.java.net
> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20150529/761f785a/attachment-0001.html>