unexplained CMS pauses

Mon Sep 28 16:39:38 PDT 2009

Ouch.  Missed that.

Paul

Y.S.Ramakrishna at Sun.COM wrote:
> Hi Paul, that would be 100 us + minor gc time (=30 ms) = 30.1 ms.
> It does not explain the 10- to 40-fold increase in txn times observed
> here (of 300 ms - 1200 ms).
>
> -- ramki
>
> On 09/28/09 16:24, Paul Hohensee wrote:
>> Might have nothing to do with concurrent mark running, rather with 
>> minor collections
>> happening during a java->native call.
>>
>> Minor collections are stop-the-world, and if one occurs during a 
>> java->native call, the
>> java thread making the call to native will block on return from the 
>> call to native until
>> the minor collection is over.  If the outliers are outliers in the 
>> time it takes to execute
>> the native call, and the minor gc duration and timing are exactly 
>> right, you could end
>> up spending most of the minor gc time blocked.  Pause time would then 
>> be 100ms
>> + minor gc time.
>>
>> Paul
>>
>> Shane Cox wrote:
>>> Ramki,
>>> We're running Solaris 10 on 8-core Intel Xeon's.  1 Java instance 
>>> per box.  JRE 1.6, update 14.  64-bit JVM.
>>>
>>> Our app is making JNI calls.  Each call requires approximately the 
>>> same amount of work (so we expect them to perform similarly).  
>>> Internally, we measure how long it takes to perform these calls. 
>>> +99% of these calls complete in less than 100 micros.  However, we 
>>> have outliers in the 300-1200ms range.  After some research, we have 
>>> found that these extreme outliers coincide with the Concurrent Mark 
>>> phase of CMS (based on timestamps), and ONLY if there is a minor GC 
>>> during that phase.
>>>
>>> In a given day, our app will execute CMS collections 500-1000 
>>> times.  Of these, less than 10 will have a minor collection execute 
>>> during the Concurrent Mark.  For each of these, our app reports a 
>>> large pause in the 300-1200ms range.  In fact, all of the large 
>>> pauses reported by our app correlate with a  minor GC during a 
>>> Concurrent Mark, without exception.
>>>
>>> Thanks
>>>
>>> On Mon, Sep 28, 2009 at 1:13 PM, <Y.S.Ramakrishna at sun.com 
>>> <mailto:Y.S.Ramakrishna at sun.com>> wrote:
>>>
>>>     What platform are you on (#cpu's etc.)
>>>     and when you say "app reports a pause of 300 ms"
>>>     is it that the odd transaction sees a latency
>>>     of 300 ms (coincident with concurrent mark),
>>>     whereas most transactions complete much more
>>>     quickly?
>>>
>>>     I am trying to first understand how you determine
>>>     that the application is seeing "long pauses"
>>>     when a minor gc occurs during concurrent mark.
>>>
>>>     PS: for example, if two gc pauses (say a scavenge of 30 ms
>>>     and an initial mark of 13 ms occur in quick succession,
>>>     your application might notice a  pause of 43 ms, etc.)
>>>
>>>     -- ramki
>>>
>>>
>>>     On 09/28/09 10:07, Shane Cox wrote:
>>>
>>>         Our application is reporting long pauses when a minor GC
>>>         occurs during the Concurrent Mark phase of CMS.  The output
>>>         below is a specific example.  All of the GC pauses are less
>>>         than 30ms (initial mark, remark, minor GC).  However, our app
>>>         reported a 300ms pause.
>>>
>>>         56750.934: [GC [1 CMS-initial-mark: 702464K(1402496K)]
>>>         719045K(1551616K), 0.0131859 secs]
>>>         56750.947: [CMS-concurrent-mark-start]
>>>         56752.133: [GC 56752.133: [ParNew: 144393K->12122K(149120K),
>>>         0.0237615 secs] 846857K->719330K(1551616K), 0.0239988 secs]
>>>         56752.162: [CMS-concurrent-mark: 1.188/1.215 secs]
>>>         56752.162: [CMS-concurrent-preclean-start]
>>>         56752.243: [CMS-concurrent-preclean: 0.070/0.081 secs]
>>>         56752.243: [CMS-concurrent-abortable-preclean-start]
>>>         56752.765: [CMS-concurrent-abortable-preclean: 0.143/0.522 
>>> secs]
>>>         56752.766: [GC[YG occupancy: 77423 K (149120 K)]56752.766:
>>>         [Rescan (parallel) , 0.0065730 secs]56752.773: [weak refs
>>>         processing, 0.0001983 secs] [1 CMS-remark: 707208K(1402496K)]
>>>         784631K(1551616K), 0.0068908 secs]
>>>         56752.773: [CMS-concurrent-sweep-start]
>>>         56753.209: [CMS-concurrent-sweep: 0.436/0.436 secs]
>>>         56753.209: [CMS-concurrent-reset-start]
>>>         56753.219: [CMS-concurrent-reset: 0.010/0.010 secs]
>>>
>>>
>>>         We only observe this behavior when a minor GC occurs during
>>>         the Concurrent Mark (which is rare).  Our app has reported
>>>         pauses up to 1.2 seconds ... which is generally the time it
>>>         takes to perform a Concurrent Mark.
>>>
>>>
>>>         Any insight/help that you could provide would be much 
>>> appreciated.
>>>
>>>         Thanks
>>>
>>>
>>>
>