RFR: 8259383: AsyncGetCallTrace() crashes sporadically

Lutz Schmidt lucy at openjdk.java.net
Fri Jan 22 15:00:47 UTC 2021


On Wed, 20 Jan 2021 03:36:52 GMT, Andrei Pangin <apangin at openjdk.org> wrote:

>> The changes look ok to me. I think it would be good to get someone from the compiler team to verify your reasoning here.
>
> Hi Lutz,
> 
> Thank you for working on this issue.
> I'm not a formal OpenJDK Reviewer, but as an author of async-profiler project (which heavily relies on AsyncGetCallTrace), I'm more or less familiar with this code. Let me leave a few comments.
> 
> 1. The statement `if (candidate.pc() != NULL) return false;` looks a bit odd to me. When a profiler calls AsyncGetCallTrace, pc of the initial frame is obtained from a signal context, and it is never NULL. Returning false when pc != NULL basically invalidates the entire `if (fr->cb() == NULL)` branch. If this was intended, the branch could be simplified.
> 2. At the same time, returning false whenever cb == NULL, discards a fair amount of valid stack traces, e.g. when a runtime function is called from Java without switching to in_vm state. I guess, the original intention of that loop was to handle such cases, but, as you noticed, it does not currently work well because of the assertions. I'd rather prefer revising the logic of that loop then instead of dropping all those valid samples.
> 3. What I don't really understand from the bug report is why `if (fr->cb() == NULL)` branch is executed at all in this particular example. According to the stack trace from the crash dump, the top frame (before a signal handler) is `ThreadCritical::~ThreadCritical()`. This means, a thread is in_vm state, and its last_Java_frame should have been set. In this case AsyncGetCallTrace prefers to take last_Java_frame, and thus cb should be non-NULL. Another suspicious thing is that a frame below `ThreadCritical::~ThreadCritical()` is not a C frame. This cannot normally happen (the execution is in the middle of `ThreadCritical::~ThreadCritical()`, where the stack frame is consistent). These facts make me think something bad has already happened with the stack earlier, and the failed guarantee is probably just one manifestation of a bigger problem.

Spent a lot fo time and many thoughts. Finally, I found out it was all for nothing. The failure doesn't occur at the location I was sure it would. Will invest more time and more thoughts... I'll be back!

-------------

PR: https://git.openjdk.java.net/jdk/pull/2032


More information about the serviceability-dev mailing list