Low-Overhead Heap Profiling

Wed Jun 24 23:26:35 UTC 2015

On Wed, Jun 24, 2015 at 10:57 AM, Tony Printezis <tprintezis at twitter.com>
wrote:

> Hi Jeremy,
>
> Please see inline.
>
> On June 23, 2015 at 7:22:13 PM, Jeremy Manson (jeremymanson at google.com)
> wrote:
>
> I don't want the size of the TLAB, which is ergonomically adjusted, to be
> tied to the sampling rate.  There is no reason to do that.  I want
> reasonable statistical sampling of the allocations.
>
>
> As I said explicitly in my e-mail, I totally agree with this. Which is why
> I never suggested to resize TLABs in order to vary the sampling rate.
> (Apologies if my e-mail was not clear.)
>

My fault - I misread it.  Doesn't your proposal miss out of TLAB allocs
entirely (and, in fact, not work if TLAB support is turned off)?  I might
be missing something obvious (and see my response below).

> All this requires is a separate counter that is set to the next sampling
> interval, and decremented when an allocation happens, which goes into a
> slow path when the decrement hits 0.  Doing a subtraction and a pointer
> bump in allocation instead of just a pointer bump is basically free.
>
>
> Maybe on intel is cheap, but maybe it’s not on other platforms that other
> folks care about.
>
Really?  A memory read and a subtraction?  Which architectures care about
that?

Again, notice that no one has complained about the addition that was added
for total bytes allocated per thread.  I note that was actually added in
the 6u20 timeframe.

Note that it has been doing an additional addition (to keep track of per
> thread allocation) as part of allocation since Java 7,
>
>
> Interesting. I hadn’t realized that. Does that keep track of total size
> allocated per thread or number of allocated objects per thread? If it’s the
> former, why isn’t it possible to calculate that from the TLABs information?
>

Total size allocated per thread.  It isn't possible to calculate that from
the TLAB because of out-of-TLAB allocation (and hypothetically disabled
TLABs).

For some reason, they never included it in the ThreadMXBean interface, but
it is in com.sun.management.ThreadMXBean, so you can cast your ThreadMXBean
to a com.sun.management.ThreadMXBean and call getThreadAllocatedBytes() on
it.

> and no one has complained.
>
> I'm not worried about the ease of implementation here, because we've
> already implemented it.
>
>
> Yeah, but someone will have to maintain it moving forward.
>

I've been maintaining it internally to Google for 5 years.  It's actually
pretty self-contained.  The only work involved is when they refactor
something (so I've had to move it), or when a bug in the existing
implementation is discovered.  It is very closely parallel to the TLAB
code, which doesn't change much / at all.

> It hasn't even been hard for us to do the forward port, except when the
> relevant Hotspot code is significantly refactored.
>
> We can also turn the sampling off, if we want.  We can set the sampling
> rate to 2^32, have the sampling code do nothing, and no one will ever
> notice.
>
>
> You still have extra instructions in the allocation path, so it’s not
> turned off (i.e., you have the tax without any benefit).
>
>
Hey, you have a counter in your allocation path you've never noticed, which
none of your code uses.  Pipelining is a wonderful thing.  :)

In fact, we could just have the sampling code do nothing, and no one would
> ever notice.
>
> Honestly, no one ever notices the overhead of the sampling, anyway.  JDK8
> made it more expensive to grab a stack trace (the cost became proportional
> to the number of loaded classes), but we have a patch that mitigates that,
> which we would also be happy to upstream.
>
> As for the other concern: my concern about *just* having the callback
> mechanism is that there is quite a lot you can't do from user code during
> an allocation, because of lack of access to JNI.
>
>
> Maybe I missed something. Are the callbacks in Java? I.e., do you call
> them using JNI from the slow path you call directly from the allocation
> code?
>
> (For context: this referred to the hypothetical feature where we can
provide a callback that invokes some code from allocation.)

(It's not actually hypothetical, because we've already implemented it, but
let's call it hypothetical for the moment.)

We invoke native code.  You can't invoke any Java code during allocation,
including calling JNI methods, because that would make allocation
potentially reentrant, which doesn't work for all sorts of reasons.  The
native code doesn't even get passed a JNIEnv * - there is nothing it can do
with it without making the VM crash a lot.

Or, rather, you might be able to do that, but it would take a lot of
Hotspot rearchitecting.  When I tried to do it, I realized it would be an
extremely deep dive.

  However, you can do pretty much anything from the VM itself.  Crucially
> (for us), we don't just log the stack traces, we also keep track of which
> are live and which aren't.  We can't do this in a callback, if the callback
> can't create weak refs to the object.
>
> What we do at Google is to have two methods: one that you pass a callback
> to (the callback gets invoked with a StackTraceData object, as I've defined
> above), and another that just tells you which sampled objects are still
> live.  We could also add a third, which allowed a callback to set the
> sampling interval (basically, the VM would call it to get the integer
> number of bytes to be allocated before the next sample).
>
> Would people be amenable to that?  It makes the code more complex, but, as
> I say, it's nice for detecting memory leaks ("Hey!  Where did that 1 GB
> object come from?").
>
>
> Well, that 1GB object would have most likely been allocated outside a TLAB
> and you could have identified it by instrumenting the “outside-of-TLAB
> allocation path” (just saying…).
>

That's orthogonal to the point I was making in the quote above - the point
I was making there was that we want to be able to detect what sampled
objects are live.  We can do that regardless of how we implement the
sampling (although it did involve my making a new kind of weak oop
processing mechanism inside the VM).

But to the question of whether we can just instrument the outside-of-tlab
allocation path...  There are a few weirdnesses here.  The first one that
jumps to mind is that there's also a fast path for allocating in the YG
outside of TLABs, if an object is too large to fit in the current TLAB.
Those objects would never get sampled.  So "outside of tlab" doesn't always
mean "slow path".

Another one that jumps to mind is that we don't know whether the
outside-of-TLAB path actually passes the sampling threshold, especially if
we let users configure the sampling threshold.  So how would we know
whether to sample it?

You also have to keep track of the sampling interval in the code where we
allocate new TLABs, in case the sampling threshold is larger than the TLAB
size.  That's not a big deal, of course.

And, every time the TLAB code changes, we have to consider whether / how
those changes affect this sampling mechanism.

I guess my larger point is that there are so many little corner cases with
TLAB allocation, including whether it even happens, that basing the
sampling strategy around it seems like a cop-out.  And my belief is that
the arguments against our strategy don't really hold water, especially
given the presence of the per-thread allocation counter that no one
noticed.

Heck, I've already had it reviewed internally by a Hotspot reviewer (Chuck
Rasbold).  All we really need is to write an acceptable JEP, to adjust the
code based on the changes the community wants, and someone from Oracle
willing to say "yes".

For reference, to keep track of sampling, the delta to C2 is about 150 LOC
(much of which is newlines-because-of-formatting for methods that take a
lot of parameters), the delta to C1 is about 60 LOC, the delta to each x86
template interpreter is about 20 LOC, and the delta for the assembler is
about 40 LOC.      It's not completely trivial, but the code hasn't changed
substantially in the 5 years since I wrote it (other than a couple of
bugfixes).

Obviously, assembler/template interpreter would have to be dup'd across
platforms - we can do that for PPC and aarch64, on which we do active
development, at least.

> But, seriously, why didn’t you like my proposal? It can do anything your
> scheme can with fewer and simpler code changes. The only thing that it
> cannot do is to sample based on object count (i.e., every 100 objects)
> instead of based on object size (i.e., every 1MB of allocations). But I
> think doing sampling based on size is the right approach here (IMHO).
>
I agree that sampling based on size is the right approach.

(And your approach is definitely simpler - I don't mean to discount it.
And if that's what it takes to get this feature accepted, we'll do it, but
I'll grumble about it.)

Jeremy
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/hotspot-gc-dev/attachments/20150624/c57b82db/attachment.htm>