How to alert for heap fragmentation

Thu Oct 11 23:30:58 PDT 2012

Todd, good question :-)

@Jesper et al, do you know the answer to Todd's question? I agree that
exposing all of these stats via suitable JMX/Mbean interfaces would be
quite useful.... The other possibility would be to log in the manner of
HP's gc logs (CSV format with suitable header), or jstat logs, so parsing
cost would be minimal. Then higher level, general tools like Kafka could
consume the log/event streams, apply suitable filters and inform/alert
interested monitoring agents.

@Todd & Saroj: Can you perhaps give some scenarios on how you might make
use of information such as this (more concretely say CMS fragmentation at a
specific JVM)? Would it be used only for "read-only" monitoring and
alerting, or do you see this as part of an automated data-centric control
system of sorts. The answer is kind of important, because something like
the latter can be accomplished today via gc log parsing (however kludgey
that might be) and something like Kafka/Zookeeper. On the other hand, I am
not sure if the latency of that kind of thing would fit well into a more
automated and fast-reacting data center control system or load-balancer
where a more direct JMX/MBean like interface might work better. Or was your
interest purely of the "development-debugging-performance-measurement"
kind, rather than of production JVMs? Anyway, thinking out loud here...

Thoughts/Comments/Suggestions?
-- ramki

On Thu, Oct 11, 2012 at 9:11 PM, Todd Lipcon <todd at cloudera.com> wrote:

> Hey Ramki,
>
> Do you know if there's any plan to offer the FLS statistics as a metric
> via JMX or some other interface in the future? It would be nice to be able
> to monitor fragmentation without having to actually log and parse the gc
> logs.
>
> -Todd
>
>
> On Thu, Oct 11, 2012 at 7:50 PM, Srinivas Ramakrishna <ysr1729 at gmail.com>wrote:
>
>> In the absence of fragmentation, one would normally expect the max chunk
>> size of the CMS generation
>> to stabilize at some reasonable value, say after some 10's of CMS GC
>> cycles. If it doesn't, you should try
>> and use a larger heap, or otherwise reshape the heap to reduce promotion
>> rates. In my experience,
>> CMS seems to work best if its "duty cycle" is of the order of 1-2 %, i.e.
>> there are 50 to 100 times more
>> scavenges during the interval that it's not running vs the interva during
>> which it is running.
>>
>> Have Nagios grep the GC log file w/PrintFLSStatistics=2 for the string
>> "Max  Chunk Size:" and pick the
>> numeric component of every (4n+1)th match. The max chunk size will
>> typically cycle within a small band,
>> once it has stabilized, returning always to a high value following a CMS
>> cycle's completion. If the upper envelope
>> of this keeps steadily declining over some 10's of CMS GC cycles, then
>> you are probably seeing fragmentation
>> that will eventually succumb to fragmentation.
>>
>> You can probably calibrate a threshold for the upper envelope so that if
>> it falls below that threshold you will
>> be alerted by Nagios that a closer look is in order.
>>
>> At least something along those lines should work. The toughest part is
>> designing your "filter" to detect the
>> fall in the upper envelope. You will probably want to plot the metric,
>> then see what kind of filter will detect
>> the condition.... Sorry this isn't much concrete help, but hopefully it
>> gives you some ideas to work in
>> the right direction...
>>
>> -- ramki
>>
>> On Thu, Oct 11, 2012 at 4:27 PM, roz dev <rozdev29 at gmail.com> wrote:
>>
>>> Hi All
>>>
>>> I am using Java 6u23, with CMS GC. I see that sometime Application gets
>>> paused for longer time because of excessive heap fragmentation.
>>>
>>> I have enabled PrintFLSStatistics flag and following is the log
>>>
>>>
>>> 2012-10-09T15:38:44.724-0400: 52404.306: [GC Before GC:
>>> Statistics for BinaryTreeDictionary:
>>> ------------------------------------
>>> Total Free Space: -668151027
>>> Max   Chunk Size: 1976112973
>>> Number of Blocks: 175445
>>> Av.  Block  Size: 20672
>>> Tree      Height: 78
>>> Before GC:
>>> Statistics for BinaryTreeDictionary:
>>> ------------------------------------
>>> Total Free Space: 10926
>>> Max   Chunk Size: 1660
>>> Number of Blocks: 22
>>> Av.  Block  Size: 496
>>> Tree      Height: 7
>>>
>>>
>>> I would like to know from people about the way they track Heap
>>> Fragmentation and how do we alert for this situation?
>>>
>>> We use Nagios and I am wondering if there is a way to parse these logs
>>> and know the max chunk size so that we can alert for it.
>>>
>>> Any inputs are welcome.
>>>
>>> -Saroj
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> hotspot-gc-use mailing list
>>> hotspot-gc-use at openjdk.java.net
>>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>>>
>>>
>>
>> _______________________________________________
>> hotspot-gc-use mailing list
>> hotspot-gc-use at openjdk.java.net
>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>>
>>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20121011/a1562179/attachment-0001.html