JVM taking a few seconds to reach a safepoint for routine young gen GC

Fri May 29 16:51:44 UTC 2015

Oops, I did not intend to remove this from the list. Re-added.

I¹ll take a look at how many RUNNABLE threads are actually blocked in native
code. I¹ll also look at VSphere to see if I can see anything unusual around
resource contention.

I¹ve grepped the safepoint logs for GenCollectForAllocation, which, as I
mentioned before, happen about every second, but only show this long sync
times during the mysterious 20-minute period. I¹ve taken an excerpt from one
of these 20-minute pauses. You can see that for most GC the only time is in
vmop and sync time is 0, but during these pauses the ¹sync¹ time takes up
the majority of the time.

[threads: total initially_running wait_to_block]    [time: spin block sync
cleanup vmop] page_trap_count

116546.586: GenCollectForAllocation          [     146          5
7    ]      [     0     0     0     1    50    ]  0
116546.891: GenCollectForAllocation          [     146          2
7    ]      [     0     0     0     2    50    ]  0
116547.969: GenCollectForAllocation          [     145          0
2    ]      [     0     0     0     2   290    ]  0
116549.500: GenCollectForAllocation          [     145          0
0    ]      [     0     0     0     2    67    ]  0
116550.836: GenCollectForAllocation          [     142          0
1    ]      [     0     0     0     1    82    ]  0
116553.398: GenCollectForAllocation          [     142          0
2    ]      [     0     0     0     2    76    ]  0
116555.109: GenCollectForAllocation          [     142          0
0    ]      [     0     0     0     2    84    ]  0
116557.328: GenCollectForAllocation          [     142          0
0    ]      [     0     0     0     2    64    ]  0
116561.992: GenCollectForAllocation          [     143          2
1    ]      [     0     0   523     2    76    ]  1
116567.367: GenCollectForAllocation          [     143          1
0    ]      [     1     0    39     2   104    ]  0
116572.438: GenCollectForAllocation          [     143          4
3    ]      [     0     0     0     2    85    ]  0
116575.977: GenCollectForAllocation          [     144         76
1    ]      [    24     0154039     9   181    ]  0
116731.336: GenCollectForAllocation          [     353         41
5    ]      [     5     0     5     1   101    ]  0
116732.328: GenCollectForAllocation          [     354          5
16    ]      [  2080     0  2115     1     0    ]  1
116736.430: GenCollectForAllocation          [     354          5
9    ]      [     0     0     0     1    81    ]  2
116736.891: GenCollectForAllocation          [     354          0
4    ]      [     0     0     0     4    88    ]  0
116737.305: GenCollectForAllocation          [     354          2
9    ]      [     0     0     0     1    80    ]  0
116737.664: GenCollectForAllocation          [     354          1
8    ]      [     0     0     0     2    65    ]  0
116738.055: GenCollectForAllocation          [     355          1
8    ]      [     0     0     0     1   106    ]  0
116738.797: GenCollectForAllocation          [     354          0
5    ]      [     0     0  2116     2   125    ]  0
116741.523: GenCollectForAllocation          [     353          1
0    ]      [     5     0   502     1   195    ]  0
116743.219: GenCollectForAllocation          [     352          1
5    ]      [     0     0     0     1     0    ]  0
116743.719: GenCollectForAllocation          [     352          1
7    ]      [     0     0     0     1    67    ]  0
116744.266: GenCollectForAllocation          [     352        271
0    ]      [    28     0764563     4     0    ]  0
117509.914: GenCollectForAllocation          [     347          1
2    ]      [     0     0     0     2   166    ]  0
117510.609: GenCollectForAllocation          [     456         84
9    ]      [     8     0     8     2   103    ]  1
117511.305: GenCollectForAllocation          [     479          0
6    ]      [     0     0     0     7   199    ]  0
117512.086: GenCollectForAllocation          [     480          0
2    ]      [     0     0     0     2   192    ]  0
117829.000: GenCollectForAllocation          [     569          0
3    ]      [     0     0     0     2     0    ]  0
117829.000: GenCollectForAllocation          [     569          2
5    ]      [     0     0     0     0   128    ]  0
117829.523: GenCollectForAllocation          [     569          0
6    ]      [     0     0     0     2    84    ]  0
117830.039: GenCollectForAllocation          [     571          0
5    ]      [     0     0     0     2     0    ]  0
117830.781: GenCollectForAllocation          [     571          0
6    ]      [     0     0     0     6    72    ]  0
117831.461: GenCollectForAllocation          [     571          0
4    ]      [     0     0     0     1     0    ]  0
117831.469: GenCollectForAllocation          [     571          0
3    ]      [     0     0     0     0   113    ]  0

From:  Vitaly Davidovich <vitalyd at gmail.com>
Date:  Thursday, May 28, 2015 at 4:20 PM
To:  Jason Goetz <jason.goetz at gmail.com>
Subject:  Re: JVM taking a few seconds to reach a safepoint for routine
young gen GC

Jason,

Not sure if you meant to reply just to me, but you did :)

So I suspect the RUNNABLE you list is what jstack gives you, which is
slightly a lie since it'll show some threads blocked in native code as
RUNNABLE.

The fact that you're on a VM is biasing me towards looking at that angle.
If there's a spike in runnable (from kernel scheduler standpoint) threads
and/or contention for resources, and it's driven by hypervisor, I wonder if
there're any artifacts in that.  I don't have much experience running
servers on VMs (only bare metal), so hard to say.  You may want to reply to
the list again and see if anyone else has more insight into this type of
setup.

Also, Poonam asked for safepoint statistics for the vm ops that were
requested -- do you have those?

On Thu, May 28, 2015 at 4:20 PM, Jason Goetz <jason.goetz at gmail.com> wrote:
> I¹m happy to answer whatever I can. Thanks for taking the time to help. It¹s
> running on a VM, not bare metal. The exact OS is Windows Server 2008. The
> database is running on another machine. There is a very large Lucene index on
> the same machine as the application and commits to this index are frequent and
> often contended.
> 
> From the thread dumps I took during these pauses (there are several that
> happen around minor GCs during these 20-minute periods) I can see the
> following stats:
> 
> Dump 1:
> Threads: 147
> RUNNABLE: 42
> WAITING: 30
> TIMED_WAITING: 75
> BLOCKED: 0
> 
> Dump 2:
>  Threads: 259
> RUNNABLE: 143
> WAITING: 47
> TIMED_WAITING: 62
> BLOCKED: 7
> 
> The only reason I believe the thread count is higher than usual on the second
> dump is that the dump follows a very long pause (69 seconds, all spent in sync
> time stopping threads for a safepoint) so I think there were several web
> requests that gathered up during this pause and needed to be served.
> 
> As far as Unsafe operations, the only thing I see in thread dumps when I grep
> for Unsafe is Unsafe.park operations in threads that are TIMED_WAITING.
> 
> As far as memory allocation, I do have some good profiling of that from the
> flight recordings that are taken and have a listing of allocations by thread.
> I haven¹t been able to see any abnormal allocations happening during the time
> of the pauses, and the total amount of memory being allocated is no different
> during these pauses. In fact, the amount of memory getting allocated (inside
> and outside TLABs) is less during these pauses as I imagine the time that
> threads are waiting for a safepoint are taking time away from running code
> that allocates memory.
> 
> From:  Vitaly Davidovich <vitalyd at gmail.com>
> Date:  Thursday, May 28, 2015 at 12:06 PM
> To:  Jason Goetz <jason.goetz at gmail.com>
> 
> Subject:  Re: JVM taking a few seconds to reach a safepoint for routine young
> gen GC
> 
> Thanks Jason.  Is this bare metal Windows or virtualized?  Of the 140-200
> active, how many are runnable at the time of the stalls?
> 
> Do you (or any used libs that you know of) use Unsafe for big memcpy style
> operations?
> 
> When these spikes occur, how many runnable procs are there on the machine? Is
> there scheduling contention perhaps (with Tomcat?)?
> 
> As for JNI, typically, java threads in JNI won't stall threads from sync'ing
> on a safepoint.
> 
> Sorry for the spanish inquisition, but may help us figure this out or at least
> get a lead.
> 
> On Thu, May 28, 2015 at 2:45 PM, Jason Goetz <jason.goetz at gmail.com> wrote:
>> Vitaly,
>> 
>> We¹ve seen 140-200 active threads during the time of the stalls but that¹s no
>> different than any other time period. There are 12 CPUs available on the JVM
>> and there is 24G in the heap, 64G on the machine.  This is the only JVM
>> running on the machine, which runs on a Windows server, and Tomcat is the
>> only application of note other than a few monitoring tools (Zabbix, HP Open
>> View, VMWare Tools), which I haven¹t had the option of turning off).
>> 
>> I¹m not sure that JNI is running. We don¹t explicitly have any JNI calls
>> running, but I¹m not sure about whether any of the 3rd-party libraries we use
>> have JNI code that I¹m unaware of. I haven¹t been able to figure out how to
>> identify if JNI calls are even running. We have taken several Java Flight
>> Recordings around these every-20-minute pauses, but haven¹t seen any patterns
>> or unusual spikes in disk I/O, thread contention, or any thread activity.
>> There is no swapping at all either.
>> 
>> Any other information that I could provide in order to give a clearer picture
>> of the system?
>> 
>> Thanks,
>> Jason
>>  
>> From:  Vitaly Davidovich <vitalyd at gmail.com>
>> Date:  Thursday, May 28, 2015 at 11:17 AM
>> To:  Jason Goetz <jason.goetz at gmail.com>
>> Cc:  hotspot-gc-use <hotspot-gc-use at openjdk.java.net>
>> Subject:  Re: JVM taking a few seconds to reach a safepoint for routine young
>> gen GC
>> 
>> 
>> Jason,
>> 
>> How many java threads are active when these stalls happen? How many CPUs are
>> available to the jvm? How much physical memory on the machine? Is your jvm
>> sole occupant of the machine or do you have noisy neighbors? You mentioned
>> JNI - do you have a lot of JNI calls around these times? Do you allocate
>> and/or write to large arrays/memory regions? Is there something
>> different/interesting about these 20 min periods (e.g. workload increases,
>> same time of day, more disk activity, any paging/swap activity, etc).
>> 
>> sent from my phone
>> 
>> On May 28, 2015 1:58 PM, "Jason Goetz" <jason.goetz at gmail.com> wrote:
>>> We're consistently seeing a situation where threads take a few seconds to
>>> stop for a routine GC. For 20 straight minutes the GC will run right away
>>> (it runs about every second). But then, during a 20-minute period, the
>>> threads will take longer to stop for GC. See the GC output below.
>>> 
>>> 2015-05-28T12:14:51.205-0500: 54796.811: Total time for which application
>>> threads were stopped: 0.1121233 seconds, Stopping threads took: 0.0000908
>>> seconds
>>> 2015-05-28T12:15:00.331-0500: 54805.930: Total time for which application
>>> threads were stopped: 0.0019384 seconds, Stopping threads took: 0.0001106
>>> seconds
>>> 2015-05-28T12:15:06.572-0500: 54812.174: [GC concurrent-mark-end, 28.4067370
>>> secs]
>>> 2015-05-28T12:15:09.786-0500: 54815.395: [GC remark
>>> 2015-05-28T12:15:09.786-0500: 54815.396: [GC ref-proc, 0.0103603 secs],
>>> 0.0709271 secs]
>>>  [Times: user=0.73 sys=0.00, real=0.08 secs]
>>> 2015-05-28T12:15:09.864-0500: 54815.466: Total time for which application
>>> threads were stopped: 3.2916224 seconds, Stopping threads took: 3.2188032
>>> seconds
>>> 2015-05-28T12:15:09.864-0500: 54815.467: [GC cleanup 20G->20G(30G),
>>> 0.0451098 secs]
>>>  [Times: user=0.61 sys=0.00, real=0.05 secs]
>>> 2015-05-28T12:15:09.910-0500: 54815.512: Total time for which application
>>> threads were stopped: 0.0459803 seconds, Stopping threads took: 0.0001950
>>> seconds
>>> 
>>> Turning on safepoint logging reveals that these stopping threads times are
>>> taken up by safepoint Œsync¹ time. Taking thread dumps every second around
>>> these pauses fail to show anything of note happening during this time, but
>>> it¹s my understanding that native code won¹t necessarily show up in thread
>>> dumps anyway given that they exit before the JVM reaches a safepoint.
>>> 
>>> Enabling PrintJNIGCStalls fails to show any logging around the 3 second
>>> pause seen above. I highly suspected JNI but was surprise that I didn¹t see
>>> any logging about JNI Weak References after turning that option on. Any
>>> ideas for what I can try next? We¹re using JDK 7u80. Here are the rest of my
>>> JVM settings:
>>> 
>>> DisableExplicitGC true
>>> FlightRecorder true
>>> GCLogFileSize 52428800
>>> ManagementServer true
>>> MinHeapSize 25769803776
>>> MaxHeapSize 25769803776
>>> MaxPermSize 536870912
>>> NumberOfGCLogFiles 10
>>> PrintAdaptiveSizePolicy true
>>> PrintGC true
>>> PrintGCApplicationStoppedTime true
>>> PrintGCCause true
>>> PrintGCDateStamps true
>>> PrintGCDetails true
>>> PrintGCTimeStamps true
>>> PrintSafepointStatistics true
>>> PrintSafepointStatisticsCount 1
>>> PrintTenuringDistribution true
>>> ReservedCodeCacheSize 268435456
>>> SafepointTimeout true
>>> SafepointTimeoutDelay 4000
>>> ThreadStackSize 4096
>>> UnlockCommercialFeatures true
>>> UseBiasedLocking false
>>> UseGCLogFileRotation false
>>> UseG1GC true
>>> PrintJNIGCStalls true
>>> 
>>> 
>>> 
>>> _______________________________________________
>>> hotspot-gc-use mailing list
>>> hotspot-gc-use at openjdk.java.net
>>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>>> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20150529/8d6f75d6/attachment.html>