How to alert for heap fragmentation

Wed Oct 17 02:59:22 PDT 2012

Hi Todd,

I don't have a strong opinion on JMX vs other ways of getting data from the 
VM. The serviceability team and other powers are deciding the policy here and 
the GC team mainly do as we are told.

G1 is constantly improving and if it was a while since you tried it it may be 
a good idea to try it again, especially if what you tried was before 7u4.
7u4 was a big step forward for G1 and it was the release where we officially 
started to support G1.
/Jesper

On 2012-10-16 23:39, Todd Lipcon wrote:
> Hi Jesper,
>
> Thanks to the links to those JEPs. From my perspective here:
>
> JEP158: unifying GC logging is definitely appreciated, but still
> leaves us to write a parser which is a bit inconvenient. We already
> have a bunch of infrastructure to poll JMX for our Java processes, and
> if it were a simple MBean to track fragmentation (the same way we
> track committed heap and gc time for example), that would be far
> better IMO.
>
> JEP167: we might make use of this, but its focus on events doesn't
> seem to map directly to what we're doing. I guess the FLS statistics
> could be exposed as event properties after a collection, which would
> be good enough, but still require coding to the JVMTI API, etc, vs
> just the simple polling we already have a lot of tooling for.
>
> So to summarize my thinking: everyone's got stuff that reads JMX
> already, and the more you can add to the existing exposed interface,
> the better.
>
> Regarding G1, I did give it a try a year or so ago and ran into a lot
> of bad behavior that caused it to full GC far more than CMS for our
> workload. I haven't given it a try on the latest, and I think there
> were some changes around 6 months ago which were supposed to address
> the issues I saw.
>
> -Todd
>
> On Fri, Oct 12, 2012 at 7:28 AM, Jesper Wilhelmsson
> <jesper.wilhelmsson at oracle.com> wrote:
>> Ramki, Todd,
>>
>> There are several projects in the pipeline for cleaning up verbose logs,
>> reporting more/better data and improving the JVM monitoring infrastructure
>> in different ways.
>>
>> Exactly what data we will add and what logging that will be improved is not
>> decided yet but I wouldn't have too high hopes that CMS is first out. Our
>> prime target for logging improvements lately has been G1 which, by the way,
>> might be worth while checking out if you are worried about fragmentation.
>>
>> We have done some initial attempts along the lines of JEP 158 [1], again
>> mainly for G1, and we are currently working with GC support for the
>> event-based JVM tracing described in JEP 167 [2]. In the latter JEP the
>> Parallel collectors (Parallel Scavenge and Parallel Old) will likely be
>> first out with a few events. Have a look at these JEPs for more details.
>>
>> [1] http://openjdk.java.net/jeps/158
>> [2] http://openjdk.java.net/jeps/167
>>
>> Best regards,
>> /Jesper
>>
>>
>> On 2012-10-12 08:30, Srinivas Ramakrishna wrote:
>>>
>>>
>>> Todd, good question :-)
>>>
>>> @Jesper et al, do you know the answer to Todd's question? I agree that
>>> exposing all of these stats via suitable JMX/Mbean interfaces would be
>>> quite
>>> useful.... The other possibility would be to log in the manner of HP's gc
>>> logs
>>> (CSV format with suitable header), or jstat logs, so parsing cost would be
>>> minimal. Then higher level, general tools like Kafka could consume the
>>> log/event streams, apply suitable filters and inform/alert interested
>>> monitoring agents.
>>>
>>> @Todd & Saroj: Can you perhaps give some scenarios on how you might make
>>> use
>>> of information such as this (more concretely say CMS fragmentation at a
>>> specific JVM)? Would it be used only for "read-only" monitoring and
>>> alerting,
>>> or do you see this as part of an automated data-centric control system of
>>> sorts. The answer is kind of important, because something like the latter
>>> can
>>> be accomplished today via gc log parsing (however kludgey that might be)
>>> and
>>> something like Kafka/Zookeeper. On the other hand, I am not sure if the
>>> latency of that kind of thing would fit well into a more automated and
>>> fast-reacting data center control system or load-balancer where a more
>>> direct
>>> JMX/MBean like interface might work better. Or was your interest purely of
>>> the
>>> "development-debugging-performance-measurement" kind, rather than of
>>> production JVMs? Anyway, thinking out loud here...
>>>
>>> Thoughts/Comments/Suggestions?
>>> -- ramki
>>>
>>> On Thu, Oct 11, 2012 at 9:11 PM, Todd Lipcon <todd at cloudera.com
>>> <mailto:todd at cloudera.com>> wrote:
>>>
>>>      Hey Ramki,
>>>
>>>      Do you know if there's any plan to offer the FLS statistics as a
>>> metric
>>>      via JMX or some other interface in the future? It would be nice to be
>>> able
>>>      to monitor fragmentation without having to actually log and parse the
>>> gc logs.
>>>
>>>      -Todd
>>>
>>>
>>>      On Thu, Oct 11, 2012 at 7:50 PM, Srinivas Ramakrishna
>>> <ysr1729 at gmail.com
>>>      <mailto:ysr1729 at gmail.com>> wrote:
>>>
>>>          In the absence of fragmentation, one would normally expect the max
>>>          chunk size of the CMS generation
>>>          to stabilize at some reasonable value, say after some 10's of CMS
>>> GC
>>>          cycles. If it doesn't, you should try
>>>          and use a larger heap, or otherwise reshape the heap to reduce
>>>          promotion rates. In my experience,
>>>          CMS seems to work best if its "duty cycle" is of the order of 1-2
>>> %,
>>>          i.e. there are 50 to 100 times more
>>>          scavenges during the interval that it's not running vs the interva
>>>          during which it is running.
>>>
>>>          Have Nagios grep the GC log file w/PrintFLSStatistics=2 for the
>>> string
>>>          "Max  Chunk Size:" and pick the
>>>          numeric component of every (4n+1)th match. The max chunk size will
>>>          typically cycle within a small band,
>>>          once it has stabilized, returning always to a high value following
>>> a
>>>          CMS cycle's completion. If the upper envelope
>>>          of this keeps steadily declining over some 10's of CMS GC cycles,
>>> then
>>>          you are probably seeing fragmentation
>>>          that will eventually succumb to fragmentation.
>>>
>>>          You can probably calibrate a threshold for the upper envelope so
>>> that
>>>          if it falls below that threshold you will
>>>          be alerted by Nagios that a closer look is in order.
>>>
>>>          At least something along those lines should work. The toughest
>>> part is
>>>          designing your "filter" to detect the
>>>          fall in the upper envelope. You will probably want to plot the
>>> metric,
>>>          then see what kind of filter will detect
>>>          the condition.... Sorry this isn't much concrete help, but
>>> hopefully
>>>          it gives you some ideas to work in
>>>          the right direction...
>>>
>>>          -- ramki
>>>
>>>          On Thu, Oct 11, 2012 at 4:27 PM, roz dev <rozdev29 at gmail.com
>>>          <mailto:rozdev29 at gmail.com>> wrote:
>>>
>>>              Hi All
>>>
>>>              I am using Java 6u23, with CMS GC. I see that sometime
>>> Application
>>>              gets paused for longer time because of excessive heap
>>> fragmentation.
>>>
>>>              I have enabled PrintFLSStatistics flag and following is the
>>> log
>>>
>>>
>>>              2012-10-09T15:38:44.724-0400: 52404.306: [GC Before GC:
>>>              Statistics for BinaryTreeDictionary:
>>>              ------------------------------------
>>>              Total Free Space: -668151027
>>>              Max   Chunk Size: 1976112973
>>>              Number of Blocks: 175445
>>>              Av.  Block  Size: 20672
>>>              Tree      Height: 78
>>>              Before GC:
>>>              Statistics for BinaryTreeDictionary:
>>>              ------------------------------------
>>>              Total Free Space: 10926
>>>              Max   Chunk Size: 1660
>>>              Number of Blocks: 22
>>>              Av.  Block  Size: 496
>>>              Tree      Height: 7
>>>
>>>
>>>              I would like to know from people about the way they track Heap
>>>              Fragmentation and how do we alert for this situation?
>>>
>>>              We use Nagios and I am wondering if there is a way to parse
>>> these
>>>              logs and know the max chunk size so that we can alert for it.
>>>
>>>              Any inputs are welcome.
>>>
>>>              -Saroj
>>>
>>>
>>>
>>>
>>>              _______________________________________________
>>>              hotspot-gc-use mailing list
>>>              hotspot-gc-use at openjdk.java.net
>>>              <mailto:hotspot-gc-use at openjdk.java.net>
>>>
>>>              http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>>>
>>>
>>>
>>>          _______________________________________________
>>>          hotspot-gc-use mailing list
>>>          hotspot-gc-use at openjdk.java.net
>>> <mailto:hotspot-gc-use at openjdk.java.net>
>>>
>>>          http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>>>
>>>
>>>
>>>
>>>      --
>>>      Todd Lipcon
>>>      Software Engineer, Cloudera
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> hotspot-gc-use mailing list
>>> hotspot-gc-use at openjdk.java.net
>>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>>>
>>
>
>
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: jesper_wilhelmsson.vcf
Type: text/x-vcard
Size: 236 bytes
Desc: not available
Url : http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20121017/bef071ce/jesper_wilhelmsson.vcf