A hotspot patch for stack profiling (frame pointer)
Staffan Larsen
staffan.larsen at oracle.com
Fri Dec 5 19:34:43 UTC 2014
Just to note that the implementation of “jstack -F” is not at all suitable for profiling since has a very high overhead (it attaches a debugger to the process).
/Staffan
> On 5 dec 2014, at 20:22, Volker Simonis <volker.simonis at gmail.com> wrote:
>
> Hi Brendan,
>
> I'm still not understanding who is taking the actual stack traces (let
> alone the symbols) in your examples. Is this done by 'perf' itself
> based only on the frame pointer?
>
> As I wrote before, this is pretty hard to get right for a JVM, but
> there are good approximations. Have you looked at the 'jstack' tool
> which is part of the JDK? If you run it on a Java process, it will
> give you exact stack traces with full inlining information. However
> this only works at safepoints so it is probably not suitable for
> profiling with performance counters. But you can also use 'jstack -F
> -m' which gives you a 'best effort' mixed Java/C++ stacaktrace (most
> of the time even with inlined Java frames. This is probably the best
> you can get when interrupting a running JVM at an arbitrary point in
> time. As you mentioned in one of your blogs, the VM can be in the
> C-Library or even in the kernel at that time which don't preserve the
> frame pointer either. So it will be already hard to even walk up to
> the first Java frame.
>
> But nevertheless, if the output of 'jstack -F -m' is "good enough" for
> your purpose, you can implement something similar in 'perf' or a
> helper library of 'perf' and be happy (I don't actually know how perf
> takes stack traces but I suppose there may some kind of callback
> mechanism for walking unknown frames). This is actually not so hard.
> I've recently implemented a "print_native_stack()" function within
> hotspot itself (you can call it for example from gdb during debugging
> - see http://hg.openjdk.java.net/jdk9/dev/hotspot/rev/86183a940db4).
> Maye you could call this functions directly from 'perf' if perf
> attaches with ptrace to the process (I assume it does or how else
> could it walk the stack)?
>
> These were just some random thoughts with the hope that they may be helpful.
>
> Regards,
> Volker
>
> PS: by the way - the flame graphs look really impressive and it would
> be really nice to have something like this for Java.
>
>
> On Thu, Dec 4, 2014 at 11:55 PM, Brendan Gregg
> <brendan.d.gregg at gmail.com> wrote:
>> G'Day,
>>
>> I've hacked hotspot to return the frame pointer, in part to see what this
>> involves, and also to have a working prototype for analysis. Along with an
>> agent to resolve symbols, this has allowed full stack profiling using Linux
>> perf_events. The following flame graphs show the resulting profiles.
>>
>> A mixed mode CPU flame graph of a vert.x benchmark (click to zoom):
>>
>> http://www.brendangregg.com/FlameGraphs/cpu-mixedmode-vertx.svg
>>
>> Same thing, but this time disabling inlining, to show more frames:
>>
>> http://www.brendangregg.com/FlameGraphs/cpu-mixedmode-flamegraph.svg
>>
>> As expected, performance is worse without inlining. You can compare the
>> flame graphs side by side to see why. Less time spent doing work / I/O!
>>
>> https://github.com/brendangregg/Misc/blob/master/java/openjdk8_b132-fp.diff
>> is my patch, and currently only works for x86-64. It removes RBP from the
>> register pools, and inserts "mov(rbp, rsp)" into two function prologues. It
>> is also unsupported: use at your own risk. I'm not a veteran hotspot
>> engineer, so chances I messed something up are high.
>>
>> I'd love to be able to enable frame pointers in Oracle JDK, eg, with an
>> -XX:+NoOmitFramePointer option. It could be put under
>> -XX:+UnlockDiagnosticVMOptions or XX:+UnlockExperimentalVMOptions. So long
>> as we had some way to turn it on. If someone wants to include (improve,
>> rewrite) my patch, please do.
>>
>> I don't have much perf data yet, but on the vert.x microbenchmark it looked
>> like returning the frame pointer cost 2.6% performance. I hope that's
>> somewhat worst-case for production workloads. (I was also able to recover
>> the 2.6% by fine tuning other options, so were this a production change, I'd
>> be hoping not to regress performance at all.)
>>
>> We've discussed this before
>> (http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2014-October/thread.html#15939).
>> The Solaris-assisted approach that Serguei Spitsyn described (JDK-6617153)
>> should work very well. The JVM can run as-is, full stacks can be generated
>> on-demand, and symbols should always be correct.
>>
>> The frame pointer approach costs a little performance, and only shows
>> partial stacks after inlining (unless you disable inlining, but that can
>> cost >40% performance). There is the other issue Volker Simonis mentioned as
>> well, where some stacks may not be profiled correctly. And, if you are
>> unlucky, symbols can move during the profile, so any static perf-map-agent
>> map will translate some incorrectly (I've considered developing a way to
>> detect this, and highlight such frames as dubious.)
>>
>> At Netflix we are mostly Java on Linux. Switching to Oracle Solaris for this
>> feature is going to be a tough sell, especially when the value of full stack
>> profiling isn't widely understood. I personally think it might be a bit
>> easier if a -XX:+NoOmitFramePointer option existed, so Linux users can try
>> the feature, then consider the better Solaris version after gaining solid
>> experience on why it is so important.
>>
>> We recently blogged about the value of stack profiling and flame graphs,
>> http://techblog.netflix.com/2014/11/nodejs-in-flames.html, although this was
>> for Node.js, which already has frame pointer support.
>>
>> If anyone wants to try generating these mixed mode CPU flame graphs
>> themselves (in a test environment!), the first step is to compile OpenJDK 8
>> b132 with the previous patch, and get that running. Also install the
>> packages for the "perf" command. The remaining steps would be something
>> like:
>>
>> # git clone --depth=1 https://github.com/brendangregg/FlameGraph
>> # git clone --depth=1 https://github.com/jrudolph/perf-map-agent
>> # cd perf-map-agent
>> # export JAVA_HOME=/...
>> # cmake .
>> # make
>> # perf record -F 99 -p `pgrep -n java` -g -- sleep 30
>> # java -cp attach-main.jar:$JAVA_HOME/lib/tools.jar
>> net.virtualvoid.perf.AttachOnce `pgrep -n java`
>> # perf script > ../FlameGraph/out.stacks
>> # cd ../FlameGraph
>> # ./stackcollapse-perf.pl < out.stacks | ./flamegraph.pl --color=java >
>> out.svg
>>
>> Finally, if you are new to CPU flame graphs, see
>> http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html .
>>
>> Brendan
More information about the serviceability-dev
mailing list