JVM taking a few seconds to reach a safepoint for routine young gen GC

Sat May 30 17:05:08 UTC 2015

Hi,

I thought that threads (blocked) in native was not a problem for
safe-pointing? Meaning that these threads could continue to execute/block
during an attempt to reach safe point, but denied to return back to
Java-land until safe-ponting is over.

Best Regards,
Gustav Åkesson
Den 30 maj 2015 04:14 skrev "Srinivas Ramakrishna" <ysr1729 at gmail.com>:

> Hi Jason --
>
> You mentioned a lucene indexer on the same box. Can you check for
> correlation between the indexing activity, paging behavior and the
> incidence of the long safe points?
>
> -- ramki
>
> ysr1729
>
> On May 29, 2015, at 09:51, Jason Goetz <jason.goetz at gmail.com> wrote:
>
> Oops, I did not intend to remove this from the list. Re-added.
>
> I’ll take a look at how many RUNNABLE threads are actually blocked in
> native code. I’ll also look at VSphere to see if I can see anything unusual
> around resource contention.
>
> I’ve grepped the safepoint logs for GenCollectForAllocation, which, as I
> mentioned before, happen about every second, but only show this long sync
> times during the mysterious 20-minute period. I’ve taken an excerpt from
> one of these 20-minute pauses. You can see that for most GC the only time
> is in vmop and sync time is 0, but during these pauses the ’sync’ time
> takes up the majority of the time.
>
> [threads: total initially_running wait_to_block]    [time: spin block sync
> cleanup vmop] page_trap_count
>
> 116546.586: GenCollectForAllocation          [     146          5
>      7    ]      [     0     0     0     1    50    ]  0
> 116546.891: GenCollectForAllocation          [     146          2
>      7    ]      [     0     0     0     2    50    ]  0
> 116547.969: GenCollectForAllocation          [     145          0
>      2    ]      [     0     0     0     2   290    ]  0
> 116549.500: GenCollectForAllocation          [     145          0
>      0    ]      [     0     0     0     2    67    ]  0
> 116550.836: GenCollectForAllocation          [     142          0
>      1    ]      [     0     0     0     1    82    ]  0
> 116553.398: GenCollectForAllocation          [     142          0
>      2    ]      [     0     0     0     2    76    ]  0
> 116555.109: GenCollectForAllocation          [     142          0
>      0    ]      [     0     0     0     2    84    ]  0
> 116557.328: GenCollectForAllocation          [     142          0
>      0    ]      [     0     0     0     2    64    ]  0
> 116561.992: GenCollectForAllocation          [     143          2
>      1    ]      [     0     0   523     2    76    ]  1
> 116567.367: GenCollectForAllocation          [     143          1
>      0    ]      [     1     0    39     2   104    ]  0
> 116572.438: GenCollectForAllocation          [     143          4
>      3    ]      [     0     0     0     2    85    ]  0
> 116575.977: GenCollectForAllocation          [     144         76
>      1    ]      [    24     0154039     9   181    ]  0
> 116731.336: GenCollectForAllocation          [     353         41
>      5    ]      [     5     0     5     1   101    ]  0
> 116732.328: GenCollectForAllocation          [     354          5
>     16    ]      [  2080     0  2115     1     0    ]  1
> 116736.430: GenCollectForAllocation          [     354          5
>      9    ]      [     0     0     0     1    81    ]  2
> 116736.891: GenCollectForAllocation          [     354          0
>      4    ]      [     0     0     0     4    88    ]  0
> 116737.305: GenCollectForAllocation          [     354          2
>      9    ]      [     0     0     0     1    80    ]  0
> 116737.664: GenCollectForAllocation          [     354          1
>      8    ]      [     0     0     0     2    65    ]  0
> 116738.055: GenCollectForAllocation          [     355          1
>      8    ]      [     0     0     0     1   106    ]  0
> 116738.797: GenCollectForAllocation          [     354          0
>      5    ]      [     0     0  2116     2   125    ]  0
> 116741.523: GenCollectForAllocation          [     353          1
>      0    ]      [     5     0   502     1   195    ]  0
> 116743.219: GenCollectForAllocation          [     352          1
>      5    ]      [     0     0     0     1     0    ]  0
> 116743.719: GenCollectForAllocation          [     352          1
>      7    ]      [     0     0     0     1    67    ]  0
> 116744.266: GenCollectForAllocation          [     352        271
>      0    ]      [    28     0764563     4     0    ]  0
> 117509.914: GenCollectForAllocation          [     347          1
>      2    ]      [     0     0     0     2   166    ]  0
> 117510.609: GenCollectForAllocation          [     456         84
>      9    ]      [     8     0     8     2   103    ]  1
> 117511.305: GenCollectForAllocation          [     479          0
>      6    ]      [     0     0     0     7   199    ]  0
> 117512.086: GenCollectForAllocation          [     480          0
>      2    ]      [     0     0     0     2   192    ]  0
> 117829.000: GenCollectForAllocation          [     569          0
>      3    ]      [     0     0     0     2     0    ]  0
> 117829.000: GenCollectForAllocation          [     569          2
>      5    ]      [     0     0     0     0   128    ]  0
> 117829.523: GenCollectForAllocation          [     569          0
>      6    ]      [     0     0     0     2    84    ]  0
> 117830.039: GenCollectForAllocation          [     571          0
>      5    ]      [     0     0     0     2     0    ]  0
> 117830.781: GenCollectForAllocation          [     571          0
>      6    ]      [     0     0     0     6    72    ]  0
> 117831.461: GenCollectForAllocation          [     571          0
>      4    ]      [     0     0     0     1     0    ]  0
> 117831.469: GenCollectForAllocation          [     571          0
>      3    ]      [     0     0     0     0   113    ]  0
>
> From: Vitaly Davidovich <vitalyd at gmail.com>
> Date: Thursday, May 28, 2015 at 4:20 PM
> To: Jason Goetz <jason.goetz at gmail.com>
> Subject: Re: JVM taking a few seconds to reach a safepoint for routine
> young gen GC
>
> Jason,
>
> Not sure if you meant to reply just to me, but you did :)
>
> So I suspect the RUNNABLE you list is what jstack gives you, which is
> slightly a lie since it'll show some threads blocked in native code as
> RUNNABLE.
>
> The fact that you're on a VM is biasing me towards looking at that angle.
> If there's a spike in runnable (from kernel scheduler standpoint) threads
> and/or contention for resources, and it's driven by hypervisor, I wonder if
> there're any artifacts in that.  I don't have much experience running
> servers on VMs (only bare metal), so hard to say.  You may want to reply to
> the list again and see if anyone else has more insight into this type of
> setup.
>
> Also, Poonam asked for safepoint statistics for the vm ops that were
> requested -- do you have those?
>
> On Thu, May 28, 2015 at 4:20 PM, Jason Goetz <jason.goetz at gmail.com>
> wrote:
>
>> I’m happy to answer whatever I can. Thanks for taking the time to help.
>> It’s running on a VM, not bare metal. The exact OS is Windows Server 2008.
>> The database is running on another machine. There is a very large Lucene
>> index on the same machine as the application and commits to this index are
>> frequent and often contended.
>>
>> From the thread dumps I took during these pauses (there are several that
>> happen around minor GCs during these 20-minute periods) I can see the
>> following stats:
>>
>> Dump 1:
>> Threads: 147
>> RUNNABLE: 42
>> WAITING: 30
>> TIMED_WAITING: 75
>> BLOCKED: 0
>>
>> Dump 2:
>>  Threads: 259
>> RUNNABLE: 143
>> WAITING: 47
>> TIMED_WAITING: 62
>> BLOCKED: 7
>>
>> The only reason I believe the thread count is higher than usual on the
>> second dump is that the dump follows a very long pause (69 seconds, all
>> spent in sync time stopping threads for a safepoint) so I think there were
>> several web requests that gathered up during this pause and needed to be
>> served.
>>
>> As far as Unsafe operations, the only thing I see in thread dumps when I
>> grep for Unsafe is Unsafe.park operations in threads that are
>> TIMED_WAITING.
>>
>> As far as memory allocation, I do have some good profiling of that from
>> the flight recordings that are taken and have a listing of allocations by
>> thread. I haven’t been able to see any abnormal allocations happening
>> during the time of the pauses, and the total amount of memory being
>> allocated is no different during these pauses. In fact, the amount of
>> memory getting allocated (inside and outside TLABs) is less during these
>> pauses as I imagine the time that threads are waiting for a safepoint are
>> taking time away from running code that allocates memory.
>>
>> From: Vitaly Davidovich <vitalyd at gmail.com>
>> Date: Thursday, May 28, 2015 at 12:06 PM
>> To: Jason Goetz <jason.goetz at gmail.com>
>>
>> Subject: Re: JVM taking a few seconds to reach a safepoint for routine
>> young gen GC
>>
>> Thanks Jason.  Is this bare metal Windows or virtualized?  Of the 140-200
>> active, how many are runnable at the time of the stalls?
>>
>> Do you (or any used libs that you know of) use Unsafe for big memcpy
>> style operations?
>>
>> When these spikes occur, how many runnable procs are there on the
>> machine? Is there scheduling contention perhaps (with Tomcat?)?
>>
>> As for JNI, typically, java threads in JNI won't stall threads from
>> sync'ing on a safepoint.
>>
>> Sorry for the spanish inquisition, but may help us figure this out or at
>> least get a lead.
>>
>> On Thu, May 28, 2015 at 2:45 PM, Jason Goetz <jason.goetz at gmail.com>
>> wrote:
>>
>>> Vitaly,
>>>
>>> We’ve seen 140-200 active threads during the time of the stalls but
>>> that’s no different than any other time period. There are 12 CPUs available
>>> on the JVM and there is 24G in the heap, 64G on the machine.  This is the
>>> only JVM running on the machine, which runs on a Windows server, and Tomcat
>>> is the only application of note other than a few monitoring tools (Zabbix,
>>> HP Open View, VMWare Tools), which I haven’t had the option of turning off).
>>>
>>> I’m not sure that JNI is running. We don’t explicitly have any JNI calls
>>> running, but I’m not sure about whether any of the 3rd-party libraries we
>>> use have JNI code that I’m unaware of. I haven’t been able to figure out
>>> how to identify if JNI calls are even running. We have taken several Java
>>> Flight Recordings around these every-20-minute pauses, but haven’t seen any
>>> patterns or unusual spikes in disk I/O, thread contention, or any thread
>>> activity. There is no swapping at all either.
>>>
>>> Any other information that I could provide in order to give a clearer
>>> picture of the system?
>>>
>>> Thanks,
>>> Jason
>>>
>>> From: Vitaly Davidovich <vitalyd at gmail.com>
>>> Date: Thursday, May 28, 2015 at 11:17 AM
>>> To: Jason Goetz <jason.goetz at gmail.com>
>>> Cc: hotspot-gc-use <hotspot-gc-use at openjdk.java.net>
>>> Subject: Re: JVM taking a few seconds to reach a safepoint for routine
>>> young gen GC
>>>
>>> Jason,
>>>
>>> How many java threads are active when these stalls happen? How many CPUs
>>> are available to the jvm? How much physical memory on the machine? Is your
>>> jvm sole occupant of the machine or do you have noisy neighbors? You
>>> mentioned JNI - do you have a lot of JNI calls around these times? Do you
>>> allocate and/or write to large arrays/memory regions? Is there something
>>> different/interesting about these 20 min periods (e.g. workload increases,
>>> same time of day, more disk activity, any paging/swap activity, etc).
>>>
>>> sent from my phone
>>> On May 28, 2015 1:58 PM, "Jason Goetz" <jason.goetz at gmail.com> wrote:
>>>
>>>> We're consistently seeing a situation where threads take a few seconds
>>>> to stop for a routine GC. For 20 straight minutes the GC will run right
>>>> away (it runs about every second). But then, during a 20-minute period, the
>>>> threads will take longer to stop for GC. See the GC output below.
>>>>
>>>> 2015-05-28T12:14:51.205-0500: 54796.811: Total time for which
>>>> application threads were stopped: 0.1121233 seconds, Stopping threads took:
>>>> 0.0000908 seconds
>>>> 2015-05-28T12:15:00.331-0500: 54805.930: Total time for which
>>>> application threads were stopped: 0.0019384 seconds, Stopping threads took:
>>>> 0.0001106 seconds
>>>> 2015-05-28T12:15:06.572-0500: 54812.174: [GC concurrent-mark-end,
>>>> 28.4067370 secs]
>>>> 2015-05-28T12:15:09.786-0500: 54815.395: [GC remark
>>>> 2015-05-28T12:15:09.786-0500: 54815.396: [GC ref-proc, 0.0103603 secs],
>>>> 0.0709271 secs]
>>>>  [Times: user=0.73 sys=0.00, real=0.08 secs]
>>>> *2015-05-28T12:15:09.864-0500: 54815.466: Total time for which
>>>> application threads were stopped: 3.2916224 seconds, Stopping threads took:
>>>> 3.2188032 seconds*
>>>> 2015-05-28T12:15:09.864-0500: 54815.467: [GC cleanup 20G->20G(30G),
>>>> 0.0451098 secs]
>>>>  [Times: user=0.61 sys=0.00, real=0.05 secs]
>>>> 2015-05-28T12:15:09.910-0500: 54815.512: Total time for which
>>>> application threads were stopped: 0.0459803 seconds, Stopping threads took:
>>>> 0.0001950 seconds
>>>>
>>>> Turning on safepoint logging reveals that these stopping threads times
>>>> are taken up by safepoint ‘sync’ time. Taking thread dumps every second
>>>> around these pauses fail to show anything of note happening during this
>>>> time, but it’s my understanding that native code won’t necessarily show up
>>>> in thread dumps anyway given that they exit before the JVM reaches a
>>>> safepoint.
>>>>
>>>> Enabling PrintJNIGCStalls fails to show any logging around the 3 second
>>>> pause seen above. I highly suspected JNI but was surprise that I didn’t see
>>>> any logging about JNI Weak References after turning that option on. Any
>>>> ideas for what I can try next? We’re using JDK 7u80. Here are the rest of
>>>> my JVM settings:
>>>>
>>>> DisableExplicitGC true
>>>> FlightRecorder true
>>>> GCLogFileSize 52428800
>>>> ManagementServer true
>>>> MinHeapSize 25769803776
>>>> MaxHeapSize 25769803776
>>>> MaxPermSize 536870912
>>>> NumberOfGCLogFiles 10
>>>> PrintAdaptiveSizePolicy true
>>>> PrintGC true
>>>> PrintGCApplicationStoppedTime true
>>>> PrintGCCause true
>>>> PrintGCDateStamps true
>>>> PrintGCDetails true
>>>> PrintGCTimeStamps true
>>>> PrintSafepointStatistics true
>>>> PrintSafepointStatisticsCount 1
>>>> PrintTenuringDistribution true
>>>> ReservedCodeCacheSize 268435456
>>>> SafepointTimeout true
>>>> SafepointTimeoutDelay 4000
>>>> ThreadStackSize 4096
>>>> UnlockCommercialFeatures true
>>>> UseBiasedLocking false
>>>> UseGCLogFileRotation false
>>>> UseG1GC true
>>>> PrintJNIGCStalls true
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> hotspot-gc-use mailing list
>>>> hotspot-gc-use at openjdk.java.net
>>>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>>>>
>>>>
>>
> _______________________________________________
> hotspot-gc-use mailing list
> hotspot-gc-use at openjdk.java.net
> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>
>
> _______________________________________________
> hotspot-gc-use mailing list
> hotspot-gc-use at openjdk.java.net
> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20150530/42ce1712/attachment.html>