A hotspot patch for stack profiling (frame pointer)

Mon Dec 8 19:17:42 UTC 2014

> On 8 dec 2014, at 16:05, Maynard Johnson <maynardj at us.ibm.com> wrote:
> 
> On 12/05/2014 05:09 PM, Brendan Gregg wrote:
>> G'Day Volker,
>> 
>> On Fri, Dec 5, 2014 at 11:22 AM, Volker Simonis
>> <volker.simonis at gmail.com> wrote:
>>> Hi Brendan,
>>> 
>>> I'm still not understanding who is taking the actual stack traces (let
>>> alone the symbols) in your examples. Is this done by 'perf' itself
>>> based only on the frame pointer?
>> 
>> perf is walking the frame pointers.
> Volker, to be specific, the perf profiling tool has a user space part and a
> kernel space part. The collection of stack traces is done by the kernel.
> When a user-specified event (or series of events) occur, the process
> being profiled is interrupted and the sampled information (which can
> optionally include a full stack trace) is made available to the user space
> perf tool to be saved to a file for future post-profiling processing.
> 
> During the profiling phase, the perf tool collects information about the
> profiled process's memory mappings, which allows for this address-to-symbol.
> resolution, It's in the post-profiling phase where the sampled instruction,
> along with its associated stack trace, are resolved to the appropriate symbol
> (i.e., function/method) in a specific binary file (e.g., library, exectuable).
> 
> And if the VM creates a /tmp/perf-<PID>.map file to save information about
> JITed methods, the perf's post-profiling tool will find it and use it to
> correlate sampled addresses it collected from the VM's executable anonymous
> memory mappings to the method names.

Is there a way in this .map file to express that different JITed methods are located at the same address at different times? This typically happens a lot when classes and their JITed methods are being unloaded from the VM. That space will be reused by a different method. I’m guessing this would confuse perf.

/Staffan

> 
> -Maynard
>> 
>> A JVMTI agent, perf-map-agent, is providing a map file for symbol
>> translation under /tmp/perf-PID.map. Linux perf already hunts for such
>> a file when doing symbol translation.
>> 
>>> 
>>> As I wrote before, this is pretty hard to get right for a JVM, but
>>> there are good approximations. Have you looked at the 'jstack' tool
>>> which is part of the JDK? If you run it on a Java process, it will
>>> give you exact stack traces with full inlining information. However
>>> this only works at safepoints so it is probably not suitable for
>>> profiling with performance counters.
>> 
>> Right, jstack works, and I get full correct stacks. I do really want
>> to take stacks at any moment: not just CPU samples, but when tracing
>> kernel TCP events, or PMC cache miss profiling, etc. perf can already
>> do many advanced tracing and profiling activities. I just needed the
>> Java stacks for context.
>> 
>>> But you can also use 'jstack -F
>>> -m' which gives you a 'best effort' mixed Java/C++ stacaktrace (most
>>> of the time even with inlined Java frames. This is probably the best
>>> you can get when interrupting a running JVM at an arbitrary point in
>>> time. As you mentioned in one of your blogs, the VM can be in the
>>> C-Library or even in the kernel at that time which don't preserve the
>>> frame pointer either. So it will be already hard to even walk up to
>>> the first Java frame.
>> 
>> Well, the JVMs I'm looking at are already built with
>> -fno-omit-frame-pointer (which is good). I edited hotspot to preserve
>> it as well.
>> 
>> Here's before I changed hotspot:
>> 
>> http://www.brendangregg.com/FlameGraphs/cpu-mixedmode-nofp.svg
>> 
>> Yes, most stacks are clearly broken.
>> 
>> After changing hotspot:
>> 
>> http://www.brendangregg.com/FlameGraphs/cpu-mixedmode-vertx.svg
>> 
>> It's looking pretty good. If you look carefully on the far left and
>> right, there are 0.8% stacks in read() and write() directly from java,
>> which may well be broken (unless a java thread is calling these
>> directly; there could also be some gcc inlining going on). Even if
>> they are broken, I can see 98% of my profile. Plus, I'd be interested
>> to know what exactly is reusing the frame pointer, so we could fix
>> that too.
>> 
>> The Java stacks themselves are also about a third as deep as they
>> should be, due to inlining.
>> 
>>> 
>>> But nevertheless, if the output of 'jstack -F -m' is "good enough" for
>>> your purpose, you can implement something similar in 'perf' or a
>>> helper library of 'perf' and be happy (I don't actually know how perf
>>> takes stack traces but I suppose there may some kind of callback
>>> mechanism for walking unknown frames). This is actually not so hard.
>>> I've recently implemented a "print_native_stack()" function within
>>> hotspot itself (you can call it for example from gdb during debugging
>>> - see http://hg.openjdk.java.net/jdk9/dev/hotspot/rev/86183a940db4).
>>> Maye you could call this functions directly from 'perf' if perf
>>> attaches with ptrace to the process (I assume it does or how else
>>> could it walk the stack)?
>> 
>> An OS-cooperative stack walker would be great, and I think the hotspot
>> team is already doing this for Oracle Solaris. Thanks for the code
>> too, this is pretty interesting.
>> 
>> jstack -F -m eats 0.5s of CPU for me, so it would need work to make
>> this into a 99 Hertz-capable profiler. Plus I'd like to pick arbitrary
>> kernel functions or tracepoints and get Java context from them, too.
>> Eg, TCP functions, memory allocation, disk I/O, etc.
>> 
>>> 
>>> These were just some random thoughts with the hope that they may be helpful.
>> 
>> Yes, thanks!
>> 
>> Brendan
>> 
>