RFR: 8178287: AsyncGetCallTrace fails to traverse valid Java stacks [v3]
Andrei Pangin
apangin at openjdk.java.net
Tue Aug 24 00:28:31 UTC 2021
On Mon, 19 Jul 2021 07:07:04 GMT, Ludovic Henry <luhenry at openjdk.org> wrote:
>> Hi Ludovic,
>>
>> Thank you for working on this long-standing bug.
>> I like the idea of the proposed solution, but unfortunately it cannot be applied as is. Since the stack walking code runs inside a signal handler, it is very limited in things it can do. In particular, it must not allocate, acquire locks, etc. In your implementation, FrameParser does allocate though.
>>
>> The issue is not just theoretical: when I ran JDK with this patch with async-profiler, I immediately got the following deadlock:
>>
>>
>> (gdb) bt
>> #0 __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
>> #1 0x00007fa2363ca025 in __GI___pthread_mutex_lock (mutex=0x7fa235da5440 <tc_mutex>)
>> at ../nptl/pthread_mutex_lock.c:80
>> #2 0x00007fa235696cb6 in ThreadCritical::ThreadCritical() () from /usr/java/jdk-18/lib/server/libjvm.so
>> #3 0x00007fa234b6fe53 in Chunk::next_chop() () from /usr/java/jdk-18/lib/server/libjvm.so
>> #4 0x00007fa234e88523 in frame::safe_for_sender(JavaThread*) () from /usr/java/jdk-18/lib/server/libjvm.so
>> #5 0x00007fa234e838f2 in vframeStreamForte::forte_next() () from /usr/java/jdk-18/lib/server/libjvm.so
>> #6 0x00007fa2349fbb9b in forte_fill_call_trace_given_top(JavaThread*, ASGCT_CallTrace*, int, frame) [clone .isra.20]
>> () from /usr/java/jdk-18/lib/server/libjvm.so
>> #7 0x00007fa234e8426e in AsyncGetCallTrace () from /usr/java/jdk-18/lib/server/libjvm.so
>> #8 0x00007fa228519312 in Profiler::getJavaTraceAsync(void*, ASGCT_CallFrame*, int) ()
>> from /mnt/c/Users/Andrei/java/async-profiler/build/libasyncProfiler.so
>> #9 0x00007fa228519c72 in Profiler::recordSample(void*, unsigned long long, int, Event*) ()
>> from /mnt/c/Users/Andrei/java/async-profiler/build/libasyncProfiler.so
>> #10 0x00007fa2285164f8 in WallClock::signalHandler(int, siginfo_t*, void*) ()
>> from /mnt/c/Users/Andrei/java/async-profiler/build/libasyncProfiler.so
>> #11 <signal handler called>
>> #12 __pthread_mutex_unlock_usercnt (decr=1, mutex=0x7fa235da5440 <tc_mutex>) at pthread_mutex_unlock.c:41
>> #13 __GI___pthread_mutex_unlock (mutex=0x7fa235da5440 <tc_mutex>) at pthread_mutex_unlock.c:356
>> #14 0x00007fa235696d3b in ThreadCritical::~ThreadCritical() () from /usr/java/jdk-18/lib/server/libjvm.so
>> #15 0x00007fa234b6fe71 in Chunk::next_chop() () from /usr/java/jdk-18/lib/server/libjvm.so
>> #16 0x00007fa234d1ca62 in ClassFileParser::parse_method(ClassFileStream const*, bool, ConstantPool const*, AccessFlags*, JavaThread*) () from /usr/java/jdk-18/lib/server/libjvm.so
>> #17 0x00007fa234d1e338 in ClassFileParser::parse_methods(ClassFileStream const*, bool, AccessFlags*, bool*, bool*, JavaThread*) () from /usr/java/jdk-18/lib/server/libjvm.so
>> #18 0x00007fa234d22459 in ClassFileParser::parse_stream(ClassFileStream const*, JavaThread*) ()
>> from /usr/java/jdk-18/lib/server/libjvm.so
>> #19 0x00007fa234d2291c in ClassFileParser::ClassFileParser(ClassFileStream*, Symbol*, ClassLoaderData*, ClassLoadInfo const*, ClassFileParser::Publicity, JavaThread*) () from /usr/java/jdk-18/lib/server/libjvm.so
>> #20 0x00007fa2351febb6 in KlassFactory::create_from_stream(ClassFileStream*, Symbol*, ClassLoaderData*, ClassLoadInfo const&, JavaThread*) ()
>> from /usr/java/jdk-18/lib/server/libjvm.so
>> #21 0x00007fa235645b40 in SystemDictionary::resolve_class_from_stream(ClassFileStream*, Symbol*, Handle, ClassLoadInfo const&, JavaThread*) ()
>> from /usr/java/jdk-18/lib/server/libjvm.so
>> #22 0x00007fa2350bad0a in jvm_define_class_common(char const*, _jobject*, signed char const*, int, _jobject*, char const*, JavaThread*) [clone .constprop.299] ()
>> from /usr/java/jdk-18/lib/server/libjvm.so
>> #23 0x00007fa2350bae6d in JVM_DefineClassWithSource () from /usr/java/jdk-18/lib/server/libjvm.so
>> #24 0x00007fa236a0ee12 in Java_java_lang_ClassLoader_defineClass1 () from /usr/java/jdk-18/lib/libjava.so
>
> @apangin Thanks for pointing that out! I'm updating it right now and should be pushing an update very soon. I'll also add examples on how it impacts JFR.
Hi @luhenry,
I had a chance to test your patch on realistic applications, and here are my observations:
1. JVM does not crash or hang, which is already a good sign.
2. The amount of `unknown_Java` samples indeed decreased, roughly by a factor of 2.
3. Now the bad news: with this patch, AsyncGetCallTrace starts returning many orphaned (truncated) stack traces that consist of just one or a few top frames.
For example, here are the results of profiling spring-petclinic application startup.
I summarized the amount of AsyncGetCallTrace failures by type in the following table.
Click on the links in the left column to open the entire flame graphs.
| Success | Failures | unknown_Java | not_walkable_Java | unknown_not_Java | not_walkable_not_Java | Truncated
-- | -- | -- | -- | -- | -- | -- | --
[Baseline](https://cr.openjdk.java.net/~apangin/8178287/flamegraph/baseline.html) | 86.17% | 13.83% | 7.55% | 1.54% | 2.28% | 2.45% | 0.01%
[AGCT patch](https://cr.openjdk.java.net/~apangin/8178287/flamegraph/agct-patch.html) | 89.75% | 10.25% | 3.77% | 1.02% | 2.26% | - | 3.20%
[Async-profiler](https://cr.openjdk.java.net/~apangin/8178287/flamegraph/async-profiler.html) | 98.12% | 1.88% | 1.45% | 0.29% | 0.09% | - | 0.05%
[Async-profiler with patch](https://cr.openjdk.java.net/~apangin/8178287/flamegraph/async-profiler-patch.html) | 94.95% | 5.05% | 1.21% | 0.05% | 0.10% | - | 3.69%
My biggest concern is the large number (>3%) of truncated stack traces. The probem is, such stack traces are indistinguishable (by the tools) from normal stack traces, and therefore the tools will display misleading information to a user, without even knowing that something went wrong. In this sense I'd prefer to return an error code rather than an invalid stack trace.
The above table illustrates the problem in practice. With the proposed patch, async-profiler actually performs worse than without it, because async-profiler can recover from most AGCT failures on its own, but not when AGCT returns an invalid stack trace instead of an error code.
In view of the aforementioned, I'll ask not to commit this patch before the problem of truncated stack traces is fixed. Thank you.
-------------
PR: https://git.openjdk.java.net/jdk/pull/4436
More information about the serviceability-dev
mailing list