Trace object allocation flag proposal

Thu Dec 16 11:26:57 PST 2010

Hi David,

Please see my comments below.

On Wed, Dec 15, 2010 at 8:47 PM, David Holmes <David.Holmes at oracle.com>wrote:

> Xiaobin Lu said the following on 12/16/10 04:00:
>
>  Thanks for your feedback, David.
>>
>> Let me try to clarify with some more background information.
>>
>> One of the problems many application has is the object allocation problem
>> (Duh ...). Over releases, GC becomes more and more frequent. It is pretty
>> hard for one single team to find out where the problem goes wrong. The
>> existing tool such as jhat or jmap is very hard to work with giant heap dump
>> file. So the flag I propose is mostly for diagnostic purpose, it won't be
>> enabled by default in production. Having said so, if we make it as a
>> manageable flag, we can turn it on when GC goes wild.
>>
>
> I don't see this flag as solving the problem. If the heap is filled by a
> millions object allocations then you will have a million trace records to
> process - as difficult as (and less informative than) processing a heap
> dump.
>

The benefit of making this flag as manageable is that we could turn it on /
off when GC becomes abnormal. GC behaviour could be pragmatically figured
out using JMX and then we could use JMX again to turn on that flag. It is
very easy for someone to come up with automation in place to trace the
object allocation on demand.

Having said so, I agree with your statement here. Perhaps we should provide
type filtering so that we won't get as much as records as we could digest.
Your thoughts on this.

>
> But just by turning on this tracing you will add so much overhead to the
> allocation (potentially) that you will likely completely change the
> allocation pattern and may allow GC to more easily keep up. Any mechanism
> has the potential to do this - even DTrace probes are not as unobstrusive as
> people make out when there are large quantities of events occurring. The key
> to success in these situations is having a means to identify the
> "interesting" events so that you only trace those.

The overhead is not as much as we might think, it is just doing printf to
display the allocation type and size. I had this flag turned on with one
workload here and the overhead added is tolerable. I will post more data
here.

The concern of affecting allocation pattern probably should give priority to
have a way to diagnosing the allocation problem. Plus, we don't know how
much impact that will occur to allocation pattern. Or did we ?

>
>
>  Another value this flag could provide is to find out why sometime OOM
>> occurs. For example, someone who wrote a search application on top of Lucene
>> framework suddenly found a crash due to OOM. The stack trace points to
>> Lucene code which is hard to instrument, so this flag could provide insight
>> to why the allocation fails. Maybe it is due to a corrupted length value
>> passed to "new byte[length];" etc.
>>
>
> A flag that enables you to track "unusual" allocations would seem of more
> use to detect this kind of condition.

I could imagine that will be a really hard problem to attack.

>
>
>
>  I like your idea on per-thread basis, however, for a lot of web
>> application, thread comes and go. It is pretty hard to pin point on which
>> thread I want to trace the object allocation.
>>
>> To answer your last question, what I am really interested in finding out
>> is what type of object allocation makes GC happens more frequent.
>>
>
> Most GC's don't know what kind of garbage they are collecting, but I wonder
> if G1 might be of more assistance here?
>
>
>  Randomly taking snapshot of heap using jmap is absolutely not an idea way
>> to do so since not only it generates a giant file which is difficult to work
>> with, also it will pause the application and cause negative impact to the
>> application throughput.
>>
>
> So will excessive tracing as per the above. Honestly I think the only real
> way to diagnoze such issue is to let the system get into the problematic
> state, take it offline, take a snapshot and work from that.
>

Most of time when GC becomes wild, we have to take it offline since we have
no choice. We had one appserver here doing full GC every 1 second. It does
nothing but to collect garbage :-). Having a way to understanding the
allocation is critical to solve this kind of production issue.

Taking snapshot is ok, but analysing the heap is pain of neck. So far, I had
no luck to get jhat to work with 2G heap dump even with -Xmx12g specified
(maybe I should fix that problem instead, John Rose suggested me a while ago
to file a bug, maybe I should work on that problem instead. Sorry for taking
so long.).

Thanks so much for the feedback,

-Xiaobin
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/hotspot-runtime-dev/attachments/20101216/e98c6c89/attachment-0001.html