How to alert for heap fragmentation

Tue Oct 16 14:39:18 PDT 2012

Hi Jesper,

Thanks to the links to those JEPs. From my perspective here:

JEP158: unifying GC logging is definitely appreciated, but still
leaves us to write a parser which is a bit inconvenient. We already
have a bunch of infrastructure to poll JMX for our Java processes, and
if it were a simple MBean to track fragmentation (the same way we
track committed heap and gc time for example), that would be far
better IMO.

JEP167: we might make use of this, but its focus on events doesn't
seem to map directly to what we're doing. I guess the FLS statistics
could be exposed as event properties after a collection, which would
be good enough, but still require coding to the JVMTI API, etc, vs
just the simple polling we already have a lot of tooling for.

So to summarize my thinking: everyone's got stuff that reads JMX
already, and the more you can add to the existing exposed interface,
the better.

Regarding G1, I did give it a try a year or so ago and ran into a lot
of bad behavior that caused it to full GC far more than CMS for our
workload. I haven't given it a try on the latest, and I think there
were some changes around 6 months ago which were supposed to address
the issues I saw.

-Todd

On Fri, Oct 12, 2012 at 7:28 AM, Jesper Wilhelmsson
<jesper.wilhelmsson at oracle.com> wrote:
> Ramki, Todd,
>
> There are several projects in the pipeline for cleaning up verbose logs,
> reporting more/better data and improving the JVM monitoring infrastructure
> in different ways.
>
> Exactly what data we will add and what logging that will be improved is not
> decided yet but I wouldn't have too high hopes that CMS is first out. Our
> prime target for logging improvements lately has been G1 which, by the way,
> might be worth while checking out if you are worried about fragmentation.
>
> We have done some initial attempts along the lines of JEP 158 [1], again
> mainly for G1, and we are currently working with GC support for the
> event-based JVM tracing described in JEP 167 [2]. In the latter JEP the
> Parallel collectors (Parallel Scavenge and Parallel Old) will likely be
> first out with a few events. Have a look at these JEPs for more details.
>
> [1] http://openjdk.java.net/jeps/158
> [2] http://openjdk.java.net/jeps/167
>
> Best regards,
> /Jesper
>
>
> On 2012-10-12 08:30, Srinivas Ramakrishna wrote:
>>
>>
>> Todd, good question :-)
>>
>> @Jesper et al, do you know the answer to Todd's question? I agree that
>> exposing all of these stats via suitable JMX/Mbean interfaces would be
>> quite
>> useful.... The other possibility would be to log in the manner of HP's gc
>> logs
>> (CSV format with suitable header), or jstat logs, so parsing cost would be
>> minimal. Then higher level, general tools like Kafka could consume the
>> log/event streams, apply suitable filters and inform/alert interested
>> monitoring agents.
>>
>> @Todd & Saroj: Can you perhaps give some scenarios on how you might make
>> use
>> of information such as this (more concretely say CMS fragmentation at a
>> specific JVM)? Would it be used only for "read-only" monitoring and
>> alerting,
>> or do you see this as part of an automated data-centric control system of
>> sorts. The answer is kind of important, because something like the latter
>> can
>> be accomplished today via gc log parsing (however kludgey that might be)
>> and
>> something like Kafka/Zookeeper. On the other hand, I am not sure if the
>> latency of that kind of thing would fit well into a more automated and
>> fast-reacting data center control system or load-balancer where a more
>> direct
>> JMX/MBean like interface might work better. Or was your interest purely of
>> the
>> "development-debugging-performance-measurement" kind, rather than of
>> production JVMs? Anyway, thinking out loud here...
>>
>> Thoughts/Comments/Suggestions?
>> -- ramki
>>
>> On Thu, Oct 11, 2012 at 9:11 PM, Todd Lipcon <todd at cloudera.com
>> <mailto:todd at cloudera.com>> wrote:
>>
>>     Hey Ramki,
>>
>>     Do you know if there's any plan to offer the FLS statistics as a
>> metric
>>     via JMX or some other interface in the future? It would be nice to be
>> able
>>     to monitor fragmentation without having to actually log and parse the
>> gc logs.
>>
>>     -Todd
>>
>>
>>     On Thu, Oct 11, 2012 at 7:50 PM, Srinivas Ramakrishna
>> <ysr1729 at gmail.com
>>     <mailto:ysr1729 at gmail.com>> wrote:
>>
>>         In the absence of fragmentation, one would normally expect the max
>>         chunk size of the CMS generation
>>         to stabilize at some reasonable value, say after some 10's of CMS
>> GC
>>         cycles. If it doesn't, you should try
>>         and use a larger heap, or otherwise reshape the heap to reduce
>>         promotion rates. In my experience,
>>         CMS seems to work best if its "duty cycle" is of the order of 1-2
>> %,
>>         i.e. there are 50 to 100 times more
>>         scavenges during the interval that it's not running vs the interva
>>         during which it is running.
>>
>>         Have Nagios grep the GC log file w/PrintFLSStatistics=2 for the
>> string
>>         "Max  Chunk Size:" and pick the
>>         numeric component of every (4n+1)th match. The max chunk size will
>>         typically cycle within a small band,
>>         once it has stabilized, returning always to a high value following
>> a
>>         CMS cycle's completion. If the upper envelope
>>         of this keeps steadily declining over some 10's of CMS GC cycles,
>> then
>>         you are probably seeing fragmentation
>>         that will eventually succumb to fragmentation.
>>
>>         You can probably calibrate a threshold for the upper envelope so
>> that
>>         if it falls below that threshold you will
>>         be alerted by Nagios that a closer look is in order.
>>
>>         At least something along those lines should work. The toughest
>> part is
>>         designing your "filter" to detect the
>>         fall in the upper envelope. You will probably want to plot the
>> metric,
>>         then see what kind of filter will detect
>>         the condition.... Sorry this isn't much concrete help, but
>> hopefully
>>         it gives you some ideas to work in
>>         the right direction...
>>
>>         -- ramki
>>
>>         On Thu, Oct 11, 2012 at 4:27 PM, roz dev <rozdev29 at gmail.com
>>         <mailto:rozdev29 at gmail.com>> wrote:
>>
>>             Hi All
>>
>>             I am using Java 6u23, with CMS GC. I see that sometime
>> Application
>>             gets paused for longer time because of excessive heap
>> fragmentation.
>>
>>             I have enabled PrintFLSStatistics flag and following is the
>> log
>>
>>
>>             2012-10-09T15:38:44.724-0400: 52404.306: [GC Before GC:
>>             Statistics for BinaryTreeDictionary:
>>             ------------------------------------
>>             Total Free Space: -668151027
>>             Max   Chunk Size: 1976112973
>>             Number of Blocks: 175445
>>             Av.  Block  Size: 20672
>>             Tree      Height: 78
>>             Before GC:
>>             Statistics for BinaryTreeDictionary:
>>             ------------------------------------
>>             Total Free Space: 10926
>>             Max   Chunk Size: 1660
>>             Number of Blocks: 22
>>             Av.  Block  Size: 496
>>             Tree      Height: 7
>>
>>
>>             I would like to know from people about the way they track Heap
>>             Fragmentation and how do we alert for this situation?
>>
>>             We use Nagios and I am wondering if there is a way to parse
>> these
>>             logs and know the max chunk size so that we can alert for it.
>>
>>             Any inputs are welcome.
>>
>>             -Saroj
>>
>>
>>
>>
>>             _______________________________________________
>>             hotspot-gc-use mailing list
>>             hotspot-gc-use at openjdk.java.net
>>             <mailto:hotspot-gc-use at openjdk.java.net>
>>
>>             http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>>
>>
>>
>>         _______________________________________________
>>         hotspot-gc-use mailing list
>>         hotspot-gc-use at openjdk.java.net
>> <mailto:hotspot-gc-use at openjdk.java.net>
>>
>>         http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>>
>>
>>
>>
>>     --
>>     Todd Lipcon
>>     Software Engineer, Cloudera
>>
>>
>>
>>
>> _______________________________________________
>> hotspot-gc-use mailing list
>> hotspot-gc-use at openjdk.java.net
>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>>
>

-- 
Todd Lipcon
Software Engineer, Cloudera