JVM taking a few seconds to reach a safepoint for routine young gen GC

Mon Jun 1 11:37:42 UTC 2015

Yes they can continue, but they also have to. I not all threads have
reached the safe poi t the action (gc) which has requested the safepoint
cannot proceed and start with the work - as it is not safe. So basically
the slowest thread determines the time it requires to reach the safepoint.

Bernd
Am 30.05.2015 19:35 schrieb "Gustav Åkesson" <gustav.r.akesson at gmail.com>:

> Hi,
>
> I thought that threads (blocked) in native was not a problem for
> safe-pointing? Meaning that these threads could continue to execute/block
> during an attempt to reach safe point, but denied to return back to
> Java-land until safe-ponting is over.
>
> Best Regards,
> Gustav Åkesson
> Den 30 maj 2015 04:14 skrev "Srinivas Ramakrishna" <ysr1729 at gmail.com>:
>
>> Hi Jason --
>>
>> You mentioned a lucene indexer on the same box. Can you check for
>> correlation between the indexing activity, paging behavior and the
>> incidence of the long safe points?
>>
>> -- ramki
>>
>> ysr1729
>>
>> On May 29, 2015, at 09:51, Jason Goetz <jason.goetz at gmail.com> wrote:
>>
>> Oops, I did not intend to remove this from the list. Re-added.
>>
>> I’ll take a look at how many RUNNABLE threads are actually blocked in
>> native code. I’ll also look at VSphere to see if I can see anything unusual
>> around resource contention.
>>
>> I’ve grepped the safepoint logs for GenCollectForAllocation, which, as I
>> mentioned before, happen about every second, but only show this long sync
>> times during the mysterious 20-minute period. I’ve taken an excerpt from
>> one of these 20-minute pauses. You can see that for most GC the only time
>> is in vmop and sync time is 0, but during these pauses the ’sync’ time
>> takes up the majority of the time.
>>
>> [threads: total initially_running wait_to_block]    [time: spin block
>> sync cleanup vmop] page_trap_count
>>
>> 116546.586: GenCollectForAllocation          [     146          5
>>      7    ]      [     0     0     0     1    50    ]  0
>> 116546.891: GenCollectForAllocation          [     146          2
>>      7    ]      [     0     0     0     2    50    ]  0
>> 116547.969: GenCollectForAllocation          [     145          0
>>      2    ]      [     0     0     0     2   290    ]  0
>> 116549.500: GenCollectForAllocation          [     145          0
>>      0    ]      [     0     0     0     2    67    ]  0
>> 116550.836: GenCollectForAllocation          [     142          0
>>      1    ]      [     0     0     0     1    82    ]  0
>> 116553.398: GenCollectForAllocation          [     142          0
>>      2    ]      [     0     0     0     2    76    ]  0
>> 116555.109: GenCollectForAllocation          [     142          0
>>      0    ]      [     0     0     0     2    84    ]  0
>> 116557.328: GenCollectForAllocation          [     142          0
>>      0    ]      [     0     0     0     2    64    ]  0
>> 116561.992: GenCollectForAllocation          [     143          2
>>      1    ]      [     0     0   523     2    76    ]  1
>> 116567.367: GenCollectForAllocation          [     143          1
>>      0    ]      [     1     0    39     2   104    ]  0
>> 116572.438: GenCollectForAllocation          [     143          4
>>      3    ]      [     0     0     0     2    85    ]  0
>> 116575.977: GenCollectForAllocation          [     144         76
>>      1    ]      [    24     0154039     9   181    ]  0
>> 116731.336: GenCollectForAllocation          [     353         41
>>      5    ]      [     5     0     5     1   101    ]  0
>> 116732.328: GenCollectForAllocation          [     354          5
>>     16    ]      [  2080     0  2115     1     0    ]  1
>> 116736.430: GenCollectForAllocation          [     354          5
>>      9    ]      [     0     0     0     1    81    ]  2
>> 116736.891: GenCollectForAllocation          [     354          0
>>      4    ]      [     0     0     0     4    88    ]  0
>> 116737.305: GenCollectForAllocation          [     354          2
>>      9    ]      [     0     0     0     1    80    ]  0
>> 116737.664: GenCollectForAllocation          [     354          1
>>      8    ]      [     0     0     0     2    65    ]  0
>> 116738.055: GenCollectForAllocation          [     355          1
>>      8    ]      [     0     0     0     1   106    ]  0
>> 116738.797: GenCollectForAllocation          [     354          0
>>      5    ]      [     0     0  2116     2   125    ]  0
>> 116741.523: GenCollectForAllocation          [     353          1
>>      0    ]      [     5     0   502     1   195    ]  0
>> 116743.219: GenCollectForAllocation          [     352          1
>>      5    ]      [     0     0     0     1     0    ]  0
>> 116743.719: GenCollectForAllocation          [     352          1
>>      7    ]      [     0     0     0     1    67    ]  0
>> 116744.266: GenCollectForAllocation          [     352        271
>>      0    ]      [    28     0764563     4     0    ]  0
>> 117509.914: GenCollectForAllocation          [     347          1
>>      2    ]      [     0     0     0     2   166    ]  0
>> 117510.609: GenCollectForAllocation          [     456         84
>>      9    ]      [     8     0     8     2   103    ]  1
>> 117511.305: GenCollectForAllocation          [     479          0
>>      6    ]      [     0     0     0     7   199    ]  0
>> 117512.086: GenCollectForAllocation          [     480          0
>>      2    ]      [     0     0     0     2   192    ]  0
>> 117829.000: GenCollectForAllocation          [     569          0
>>      3    ]      [     0     0     0     2     0    ]  0
>> 117829.000: GenCollectForAllocation          [     569          2
>>      5    ]      [     0     0     0     0   128    ]  0
>> 117829.523: GenCollectForAllocation          [     569          0
>>      6    ]      [     0     0     0     2    84    ]  0
>> 117830.039: GenCollectForAllocation          [     571          0
>>      5    ]      [     0     0     0     2     0    ]  0
>> 117830.781: GenCollectForAllocation          [     571          0
>>      6    ]      [     0     0     0     6    72    ]  0
>> 117831.461: GenCollectForAllocation          [     571          0
>>      4    ]      [     0     0     0     1     0    ]  0
>> 117831.469: GenCollectForAllocation          [     571          0
>>      3    ]      [     0     0     0     0   113    ]  0
>>
>> From: Vitaly Davidovich <vitalyd at gmail.com>
>> Date: Thursday, May 28, 2015 at 4:20 PM
>> To: Jason Goetz <jason.goetz at gmail.com>
>> Subject: Re: JVM taking a few seconds to reach a safepoint for routine
>> young gen GC
>>
>> Jason,
>>
>> Not sure if you meant to reply just to me, but you did :)
>>
>> So I suspect the RUNNABLE you list is what jstack gives you, which is
>> slightly a lie since it'll show some threads blocked in native code as
>> RUNNABLE.
>>
>> The fact that you're on a VM is biasing me towards looking at that
>> angle.  If there's a spike in runnable (from kernel scheduler standpoint)
>> threads and/or contention for resources, and it's driven by hypervisor, I
>> wonder if there're any artifacts in that.  I don't have much experience
>> running servers on VMs (only bare metal), so hard to say.  You may want to
>> reply to the list again and see if anyone else has more insight into this
>> type of setup.
>>
>> Also, Poonam asked for safepoint statistics for the vm ops that were
>> requested -- do you have those?
>>
>> On Thu, May 28, 2015 at 4:20 PM, Jason Goetz <jason.goetz at gmail.com>
>> wrote:
>>
>>> I’m happy to answer whatever I can. Thanks for taking the time to help.
>>> It’s running on a VM, not bare metal. The exact OS is Windows Server 2008.
>>> The database is running on another machine. There is a very large Lucene
>>> index on the same machine as the application and commits to this index are
>>> frequent and often contended.
>>>
>>> From the thread dumps I took during these pauses (there are several that
>>> happen around minor GCs during these 20-minute periods) I can see the
>>> following stats:
>>>
>>> Dump 1:
>>> Threads: 147
>>> RUNNABLE: 42
>>> WAITING: 30
>>> TIMED_WAITING: 75
>>> BLOCKED: 0
>>>
>>> Dump 2:
>>>  Threads: 259
>>> RUNNABLE: 143
>>> WAITING: 47
>>> TIMED_WAITING: 62
>>> BLOCKED: 7
>>>
>>> The only reason I believe the thread count is higher than usual on the
>>> second dump is that the dump follows a very long pause (69 seconds, all
>>> spent in sync time stopping threads for a safepoint) so I think there were
>>> several web requests that gathered up during this pause and needed to be
>>> served.
>>>
>>> As far as Unsafe operations, the only thing I see in thread dumps when I
>>> grep for Unsafe is Unsafe.park operations in threads that are
>>> TIMED_WAITING.
>>>
>>> As far as memory allocation, I do have some good profiling of that from
>>> the flight recordings that are taken and have a listing of allocations by
>>> thread. I haven’t been able to see any abnormal allocations happening
>>> during the time of the pauses, and the total amount of memory being
>>> allocated is no different during these pauses. In fact, the amount of
>>> memory getting allocated (inside and outside TLABs) is less during these
>>> pauses as I imagine the time that threads are waiting for a safepoint are
>>> taking time away from running code that allocates memory.
>>>
>>> From: Vitaly Davidovich <vitalyd at gmail.com>
>>> Date: Thursday, May 28, 2015 at 12:06 PM
>>> To: Jason Goetz <jason.goetz at gmail.com>
>>>
>>> Subject: Re: JVM taking a few seconds to reach a safepoint for routine
>>> young gen GC
>>>
>>> Thanks Jason.  Is this bare metal Windows or virtualized?  Of the
>>> 140-200 active, how many are runnable at the time of the stalls?
>>>
>>> Do you (or any used libs that you know of) use Unsafe for big memcpy
>>> style operations?
>>>
>>> When these spikes occur, how many runnable procs are there on the
>>> machine? Is there scheduling contention perhaps (with Tomcat?)?
>>>
>>> As for JNI, typically, java threads in JNI won't stall threads from
>>> sync'ing on a safepoint.
>>>
>>> Sorry for the spanish inquisition, but may help us figure this out or at
>>> least get a lead.
>>>
>>> On Thu, May 28, 2015 at 2:45 PM, Jason Goetz <jason.goetz at gmail.com>
>>> wrote:
>>>
>>>> Vitaly,
>>>>
>>>> We’ve seen 140-200 active threads during the time of the stalls but
>>>> that’s no different than any other time period. There are 12 CPUs available
>>>> on the JVM and there is 24G in the heap, 64G on the machine.  This is the
>>>> only JVM running on the machine, which runs on a Windows server, and Tomcat
>>>> is the only application of note other than a few monitoring tools (Zabbix,
>>>> HP Open View, VMWare Tools), which I haven’t had the option of turning off).
>>>>
>>>> I’m not sure that JNI is running. We don’t explicitly have any JNI
>>>> calls running, but I’m not sure about whether any of the 3rd-party
>>>> libraries we use have JNI code that I’m unaware of. I haven’t been able to
>>>> figure out how to identify if JNI calls are even running. We have taken
>>>> several Java Flight Recordings around these every-20-minute pauses, but
>>>> haven’t seen any patterns or unusual spikes in disk I/O, thread contention,
>>>> or any thread activity. There is no swapping at all either.
>>>>
>>>> Any other information that I could provide in order to give a clearer
>>>> picture of the system?
>>>>
>>>> Thanks,
>>>> Jason
>>>>
>>>> From: Vitaly Davidovich <vitalyd at gmail.com>
>>>> Date: Thursday, May 28, 2015 at 11:17 AM
>>>> To: Jason Goetz <jason.goetz at gmail.com>
>>>> Cc: hotspot-gc-use <hotspot-gc-use at openjdk.java.net>
>>>> Subject: Re: JVM taking a few seconds to reach a safepoint for routine
>>>> young gen GC
>>>>
>>>> Jason,
>>>>
>>>> How many java threads are active when these stalls happen? How many
>>>> CPUs are available to the jvm? How much physical memory on the machine? Is
>>>> your jvm sole occupant of the machine or do you have noisy neighbors? You
>>>> mentioned JNI - do you have a lot of JNI calls around these times? Do you
>>>> allocate and/or write to large arrays/memory regions? Is there something
>>>> different/interesting about these 20 min periods (e.g. workload increases,
>>>> same time of day, more disk activity, any paging/swap activity, etc).
>>>>
>>>> sent from my phone
>>>> On May 28, 2015 1:58 PM, "Jason Goetz" <jason.goetz at gmail.com> wrote:
>>>>
>>>>> We're consistently seeing a situation where threads take a few seconds
>>>>> to stop for a routine GC. For 20 straight minutes the GC will run right
>>>>> away (it runs about every second). But then, during a 20-minute period, the
>>>>> threads will take longer to stop for GC. See the GC output below.
>>>>>
>>>>> 2015-05-28T12:14:51.205-0500: 54796.811: Total time for which
>>>>> application threads were stopped: 0.1121233 seconds, Stopping threads took:
>>>>> 0.0000908 seconds
>>>>> 2015-05-28T12:15:00.331-0500: 54805.930: Total time for which
>>>>> application threads were stopped: 0.0019384 seconds, Stopping threads took:
>>>>> 0.0001106 seconds
>>>>> 2015-05-28T12:15:06.572-0500: 54812.174: [GC concurrent-mark-end,
>>>>> 28.4067370 secs]
>>>>> 2015-05-28T12:15:09.786-0500: 54815.395: [GC remark
>>>>> 2015-05-28T12:15:09.786-0500: 54815.396: [GC ref-proc, 0.0103603 secs],
>>>>> 0.0709271 secs]
>>>>>  [Times: user=0.73 sys=0.00, real=0.08 secs]
>>>>> *2015-05-28T12:15:09.864-0500: 54815.466: Total time for which
>>>>> application threads were stopped: 3.2916224 seconds, Stopping threads took:
>>>>> 3.2188032 seconds*
>>>>> 2015-05-28T12:15:09.864-0500: 54815.467: [GC cleanup 20G->20G(30G),
>>>>> 0.0451098 secs]
>>>>>  [Times: user=0.61 sys=0.00, real=0.05 secs]
>>>>> 2015-05-28T12:15:09.910-0500: 54815.512: Total time for which
>>>>> application threads were stopped: 0.0459803 seconds, Stopping threads took:
>>>>> 0.0001950 seconds
>>>>>
>>>>> Turning on safepoint logging reveals that these stopping threads times
>>>>> are taken up by safepoint ‘sync’ time. Taking thread dumps every second
>>>>> around these pauses fail to show anything of note happening during this
>>>>> time, but it’s my understanding that native code won’t necessarily show up
>>>>> in thread dumps anyway given that they exit before the JVM reaches a
>>>>> safepoint.
>>>>>
>>>>> Enabling PrintJNIGCStalls fails to show any logging around the 3
>>>>> second pause seen above. I highly suspected JNI but was surprise that I
>>>>> didn’t see any logging about JNI Weak References after turning that option
>>>>> on. Any ideas for what I can try next? We’re using JDK 7u80. Here are the
>>>>> rest of my JVM settings:
>>>>>
>>>>> DisableExplicitGC true
>>>>> FlightRecorder true
>>>>> GCLogFileSize 52428800
>>>>> ManagementServer true
>>>>> MinHeapSize 25769803776
>>>>> MaxHeapSize 25769803776
>>>>> MaxPermSize 536870912
>>>>> NumberOfGCLogFiles 10
>>>>> PrintAdaptiveSizePolicy true
>>>>> PrintGC true
>>>>> PrintGCApplicationStoppedTime true
>>>>> PrintGCCause true
>>>>> PrintGCDateStamps true
>>>>> PrintGCDetails true
>>>>> PrintGCTimeStamps true
>>>>> PrintSafepointStatistics true
>>>>> PrintSafepointStatisticsCount 1
>>>>> PrintTenuringDistribution true
>>>>> ReservedCodeCacheSize 268435456
>>>>> SafepointTimeout true
>>>>> SafepointTimeoutDelay 4000
>>>>> ThreadStackSize 4096
>>>>> UnlockCommercialFeatures true
>>>>> UseBiasedLocking false
>>>>> UseGCLogFileRotation false
>>>>> UseG1GC true
>>>>> PrintJNIGCStalls true
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> hotspot-gc-use mailing list
>>>>> hotspot-gc-use at openjdk.java.net
>>>>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>>>>>
>>>>>
>>>
>> _______________________________________________
>> hotspot-gc-use mailing list
>> hotspot-gc-use at openjdk.java.net
>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>>
>>
>> _______________________________________________
>> hotspot-gc-use mailing list
>> hotspot-gc-use at openjdk.java.net
>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>>
>>
> _______________________________________________
> hotspot-gc-use mailing list
> hotspot-gc-use at openjdk.java.net
> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20150601/4bae8b8d/attachment-0001.html>