Low-Overhead Heap Profiling

Tue Jun 23 23:21:51 UTC 2015

I don't want the size of the TLAB, which is ergonomically adjusted, to be
tied to the sampling rate.  There is no reason to do that.  I want
reasonable statistical sampling of the allocations.

All this requires is a separate counter that is set to the next sampling
interval, and decremented when an allocation happens, which goes into a
slow path when the decrement hits 0.  Doing a subtraction and a pointer
bump in allocation instead of just a pointer bump is basically free.  Note
that it has been doing an additional addition (to keep track of per thread
allocation) as part of allocation since Java 7, and no one has complained.

I'm not worried about the ease of implementation here, because we've
already implemented it.  It hasn't even been hard for us to do the forward
port, except when the relevant Hotspot code is significantly refactored.

We can also turn the sampling off, if we want.  We can set the sampling
rate to 2^32, have the sampling code do nothing, and no one will ever
notice.  In fact, we could just have the sampling code do nothing, and no
one would ever notice.

Honestly, no one ever notices the overhead of the sampling, anyway.  JDK8
made it more expensive to grab a stack trace (the cost became proportional
to the number of loaded classes), but we have a patch that mitigates that,
which we would also be happy to upstream.

As for the other concern: my concern about *just* having the callback
mechanism is that there is quite a lot you can't do from user code during
an allocation, because of lack of access to JNI.  However, you can do
pretty much anything from the VM itself.  Crucially (for us), we don't just
log the stack traces, we also keep track of which are live and which
aren't.  We can't do this in a callback, if the callback can't create weak
refs to the object.

What we do at Google is to have two methods: one that you pass a callback
to (the callback gets invoked with a StackTraceData object, as I've defined
above), and another that just tells you which sampled objects are still
live.  We could also add a third, which allowed a callback to set the
sampling interval (basically, the VM would call it to get the integer
number of bytes to be allocated before the next sample).

Would people be amenable to that?  It makes the code more complex, but, as
I say, it's nice for detecting memory leaks ("Hey!  Where did that 1 GB
object come from?").

Jeremy

On Tue, Jun 23, 2015 at 1:06 PM, Tony Printezis <tprintezis at twitter.com>
wrote:

> Jeremy (and all),
>
> I’m not on the serviceability list so I won’t include the messages so far.
> :-) Also CCing the hotspot GC list, in case they have some feedback on this.
>
> Could I suggest a (much) simpler but at least as powerful and flexible way
> to do this? (This is something we’ve been meaning to do for a while now for
> TwitterJDK, the JDK we develop and deploy here at Twitter.) You can force
> allocations to go into the slow path periodically by artificially setting
> the TLAB top to a lower value. So, imagine a TLAB is 4M. You can set top to
> (bottom+1M). When an allocation thinks the TLAB is full (in this case, the
> first 1MB is full) it will call the allocation slow path. There, you can
> intercept it, sample the allocation (and, like in your case, you’ll also
> have the correct stack trace), notice that the TLAB is not actually full,
> extend its to top to, say, (bottom+2M), and you’re done.
>
> Advantages of this approach:
>
> * This is a much smaller, simpler, and self-contained change (no compiler
> changes necessary to maintain...).
>
> * When it’s off, the overhead is only one extra test at the slow path TLAB
> allocation (i.e., negligible; we do some sampling on TLABs in TwitterJDK
> using a similar mechanism and, when it’s off, I’ve observed no performance
> overhead).
>
> * (most importantly) You can turn this on and off, and adjust the sampling
> rate, dynamically. If you do the sampling based on JITed code, you’ll have
> to recompile all methods with allocation sites to turn the sampling on or
> off. (You can of course have it always on and just discard the output; it’d
> be nice not to have to do that though. IMHO, at least.)
>
> * You can also very cheaply turn this on and off (or adjust the sampling
> frequncy) per thread, if that’s be helpful in some way (just add the
> appropriate info on the thread’s TLAB).
>
> A few extra comments on the previous discussion:
>
> * "JFR samples per new TLAB allocation. It provides really very good
> picture and I haven't seen overhead more than 2” : When TLABs get very
> large, I don’t think sampling one object per TLAB is enough to get a good
> sample (IMHO, at least). It’s probably OK for something like jbb which
> mostly allocates instances of a handful of classes and has very few
> allocation sites. But, a lot of the code we run at Twitter is a lot more
> elaborate than that and, in our experience, sampling one object per TLAB is
> not enough. You can, of course, decrease the TLAB size to increase the
> sampling size. But it’d be good not to have to do that given a smaller TLAB
> size could increase contention across threads.
>
> * "Should it *just* take a stack trace, or should the behavior be
> configurable?” : I think we’d have to separate the allocation sampling
> mechanism from the consumption of the allocation samples. Once the sampling
> mechanism is in, different JVMs can take advantage of it in different ways.
> I assume that the Oracle folks would like at least a JFR event for every
> such sample. But in your build you can add extra code to collect the
> information in the way you have now.
>
> * Talking of JFR, it’s a bit unfortunate that the AllocObjectInNewTLAB
> event has both the new TLAB information and the allocation information. It
> would have been nice if that event was split into two, say NewTLAB and
> AllocObjectInTLAB, and we’d be able to fire the latter for each sample.
>
> * "Should the interval between samples be configurable?” : Totally. In
> fact, it’d be helpful if it was configurable dynamically. Imagine if a JVM
> starts misbehaving after 2-3 weeks of running. You can dynamically increase
> the sampling rate to get a better profile if the default is not giving
> fine-grain enough information.
>
> * "As long of these features don’t contribute to sampling bias” : If the
> sampling interval is fixed, sampling bias would be a very real concern. In
> the above example, I’d increment top by 1M (the sampling frequency) + p% (a
> fudge factor).
>
> * "Yes, a perhaps optional callbacks would be nice too.” : Oh, no. :-)
> But, as I said, we should definitely separate the sampling mechanism from
> the mechanism that consumes the samples.
>
> * "Another problem with our submitting things is that we can't really test
> on anything other than Linux.” : Another reason to go with a as platform
> independent solution as possible. :-)
>
> Regards,
>
> Tony
>
> -----
>
> Tony Printezis | JVM/GC Engineer / VM Team | Twitter
>
> @TonyPrintezis
> tprintezis at twitter.com
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/serviceability-dev/attachments/20150623/44e2cbff/attachment.html>