Low-Overhead Heap Profiling

Tue Jun 23 07:31:22 UTC 2015

Hm... I missed that JFR did this, when that happened.  I suppose I should
pay attention to JFR, but a) I had done this many years earlier, and b) the
quick skim I did of JFR when it came out indicated to me that it wasn't
really doing anything interesting we don't already do, and that it would be
painful to get it to work in our infrastructure.

Having said that, there are a number of differences between our
approaches.  I think ours covers some use cases yours doesn't.

One difference is that we don't sample at new TLAB allocation events or
outside-of-TLAB events, we sample at points generated by a randomized
probability function over an exponential distribution.  I suspect that our
methodology is somewhat better: because TLABs are a fixed size, new "new
TLAB allocation event" strategy is likely to see sampling bias when
allocations are periodic.  For example, if TLABs are 512K, and my app
allocates, say, a predictable 128K of objects, only the first object in
that 128K will ever get sampled.

I'm not qualified to do the math to figure out how bad it is (my statistics
are basically rusted shut from disuse), but this sort of sampling issue is
a well known problem in selecting statistical samples.

The additional work to make the right thing happen isn't too hard.  We just
had to have the various compilers/interpreters implement a counter that
decrements until you reach the next sample.  When you take the sample, you
pick the next sample by picking a random point in the distribution.  A
single decrement doesn't really make allocation more expensive.

Another difference would be that our interface allows you to query just the
live objects in the heap, which lets people pinpoint memory leaks really
quickly.  It doesn't look from the blog entry as if yours does that?
Perhaps I'm misreading?

Similarly, ours lets you see stack traces for sampled objects that have
recently been made garbage, which allows you to pinpoint allocations that
might not be necessary.  We have plans to have a similar buffer for sampled
objects that have become garbage since the beginning of time.

Another difference would be that people can write their own code to query
it.  I believe that JFR is pretty tied to your UI?  Or am I mistaken about
that?  We like to roll up the stats from a bunch of running JVMs, which you
have to authenticate to connect to, store them offline, and slice and dice
them in various ways.  This means that we have a bunch of bespoke code to
export this data.

And, of course, as Kirk points out, the licensing.  Does that put the
kibosh on any attempt to get Oracle buyin for this?  Maybe we can get buyin
to close the feature gap?

Naturally, our approach, being in-VM, also does not have false positives
for eliminated allocations.

Jeremy

On Mon, Jun 22, 2015 at 10:31 PM, Vladimir Voskresensky <
vladimir.voskresensky at oracle.com> wrote:

> Hello Jeremy,
>
> If this is sampling, not tracing, then how is it different from the
> low-overhead memory profiling provided by JFR [1].
> JFR samples per new TLAB allocation. It provides really very good picture
> and I haven't seen overhead more than 2%.
> Btw, JFR also does not have false positives reported by instrumented
> approaches for the cases when JIT was able to eliminate heap allocation.
>
> Thanks,
> Vladimir.
> [1] http://hirt.se/blog/?p=381
>
>
> On 22.06.2015 11:48, Jeremy Manson wrote:
>
>> Hey folks,
>>
>> (cc'ing Aleksey and John, because I mentioned this to them at the JVMLS
>> last year, but I never followed up.)
>>
>> We have a patch at Google I've always wanted to contribute to OpenJDK,
>> but I always figured it would be unwanted.  I've recently been thinking
>> that might not be as true, though.  I thought I would ask if there is any
>> interest / if I should write a JEP / if I should just forget it.
>>
>> The basic problem is that there is no low-overhead supported way to
>> figure out where allocation hotspots are. That is, sets of stack traces
>> where lots of allocation / large allocations took place.
>>
>> What I had originally done (this was in ~2007) was use bytecode rewriting
>> to instrument allocation sites.  The instrumentation would call a Java
>> method, and the method would capture a stack trace.  To reduce overhead,
>> there was a per-thread counter that only took a stack trace once every N
>> bytes allocated, where N is a randomly chosen point in a probability
>> distribution that centered around ~512K.
>>
>> This was *way* too slow, and it didn't pick up allocations through JNI,
>> so I instrumented allocations at the VM level, and the overhead went away.
>> The sampling is always turned on in our internal VMs, and a user can just
>> query an interface for a list of sampled stack traces.  The allocated stack
>> traces are held with weak refs, so you only get live samples.
>>
>> The additional overhead for allocations amounts to a subtraction, and an
>> occasional stack trace, which is usually a very, very small amount of our
>> CPU (although I had to do some jiggering in JDK8 to fix some performance
>> regressions).
>>
>> There really isn't another good way to do this with low overhead.  I was
>> wondering how the gruop would feel about our contributing it?
>>
>> Thoughts?
>>
>> Jeremy
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/serviceability-dev/attachments/20150623/25b86dcb/attachment-0001.html>