CMS cycles triggered by Perm almost being full

Wed Oct 2 18:51:25 UTC 2013

On Wed, Oct 2, 2013 at 11:33 AM, Kirk Pepperdine <kirk at kodewerk.com> wrote:
> Hi Ramki,
>
> Comments inline.
>
>>>
>>> The IOF is set to 80% and tenured is not close to that value. Here is a
>>> typical record.
>>>
>>> 33.910: [GC [1 CMS-initial-mark: 187905K(360448K)] 218048K(507904K),
>>> 0.0121320 secs] [Times: user=0.01 sys=0.00, real=0.01 secs]
>>>
>>> So 19 of 30 is below the 80% mark.
>>
>>
>> That's around 50%.  This might be the very first CMS cycle. That cycle
>> is initiated
>
> It's actually not the first CMS cycle. It's early but CMS has run a number of cycles by this point in time.
>
>
>> "early" at a default setting of 50% i believe, to bootstrap the
>> statistics for the "automatic
>> trigger levelling" that CMS otherwise does.
>>
>> Also, as Mikael stated, if UseCMSInitiatingOccupancyOnly is not set,
>> the IOF becomes a high water
>> mark for triggering collections. If the autonomics decides it must
>> trigger a cycle earlier, it might do
>> so. I'd suggest looking at the verbose option that tells you why CMS
>> is initiating a cycle. Perhaps
>> as Mikael noted, the rate of promotion is deemed high enough that a
>> cycle needs to be started
>> sooner than you might otherwise expect.
>
> Right, there is no promotion happening because the app is 100% idle. There are no ParNew's happening, only back to back CMS cycles that last for about 5 seconds and are triggered after a 2 second pause. Hence the 7 second cycle time.
>
>>
>>
>>>
>>> (concurrent mode interrupted): 195585K->188751K(360448K), 1.1356885 secs]
>>> 321137K->188751K(507904K), [CMS Perm : 50152K->49867K(50240K)], 1.1360110
>>> secs] [Times: user=1.02 sys=0.01, real=1.14 secs]
>>>
>>> Shows that Perm is quite full. This CMF was a result of a System.gc() during
>>> the abortable-preclean.
>>
>> Was the perm gen expanded ? Could you include a more complete gc log
>> that shows the
>> issue?
>
> Unfortunately this is the *only* perm record available at the moment. The heap dump at the end of the log (of course) shows the same occupancy.
>>
>>>
>>> PrintFlagsFinal shows
>>>
>>>    uintx MaxPermHeapExpansion          = 5439488         {product}
>>>    uintx MaxPermSize                               = 85983232        {pd
>>> product}
>>>
>>> To Mikael's point, I looked at that bit of code this morning and it seemed
>>> that everything would work if CMSClassUnloadingEnabled was set. But we seem
>>> to degenerate into this endless cycle of CMS collections that can't solve
>>> the problem with CMSClassUnloadingEnabled is not set. Of course the answer
>>> is to set CMSClassUnloadingEnabled however... without it being set the
>>> behaviour is confusing for people to diagnose and it seems as if the JVM
>>> should call for a full gc with the reason being perm space needs to be
>>> resized but CMS can't do that for what ever reason.
>>
>> I  don't understand your point about "perm space needs to be resized".
>> That assumes
>> that perm occupancy is feeding into the decision to trigger a CMS
>> cycle but, as Mikael
>> pointed out, it isn't.
>
> Humm.... I need to pull 7.0 as I've only 8 and wouldn't you know it, 8 doesn't have any permspace code in it!

Ah, sorry, I wouldn't be familiar with the perm(metadatapsace)-related
triggers in 8.
There may have been something in the Q&A following Coleen's perm
removal talk at J1
perhaps. I haven't looked at the code,so  I'll let Jon/Mikael etc.
speak to that.

>
>>
>>>
>>> On the interesting distraction, I believe the 7.390 second cycle time is due
>>> to the constant unit of work that CMS has facing it since the JVM/app is
>>> idle. That 7.390 seconds includes ~2 seconds between reset and initial-mark.
>>
>> That seems reasonable. I think as Mikael pointed out the data on promotions
>> stops being updated once the app goes idle, so a sudden increase in
>> the promotion
>> rates just before the app went idle locks the decision into place for as long
>> as the stats aren't updated. My guess is that if a young gc were to occur, the
>> stats would be updated and CMS would snap out of it.
>
> Makes sense but how does this fit with what is happening. At the end of the cycle the collector decides when it needs to run again? And if it's wrong (ie promotion rates are higher than anticipated) the cycle can be triggered sooner?
>

Yes. The decision is made either at the end of a young gc or, if a
young gc doesn't happen soon enough, then the
CMS thread checks after a certain waiting period (CMSWaitDuration?)
whether it should run based on the triggering criteria.
Of course the input data to those criteria is stuck in the past
because no new samples on
promotion rate have been collected in the absence of minor gc's. So we
keep relying on
obsolete data and keep running periodically.

>>
>> So the real fix would be to have the stats decay even when no minor
>> gc's are occuring.
>> In fact, IIRC, there's a bug filed on this on CMS many years ago. It
>> took me a while to
>> remember.
>>
>> So Mikael is spot on when he suspected "I'm not sure how well that
>> code scales back if an application suddenly goes completely idle." I
>> don't think it does. There is no time-decay built into the ergonomics,
>> if no new data is collected, and if
>> there is no young gc we stay locked into an earlier decision. The fix
>> would be to decay that value with time (perhaps at each new CMS cycle
>> by looking at when the last sample was collected for promotion rate,
>> or something like that).
>
> Again, makes sense and seems reasonable....
>
> -- Kirk
>

-- ramki