RFR: 8313796: AsyncGetCallTrace crash on unreadable interpreter method pointer [v4]

Tue Aug 8 08:49:41 UTC 2023

On Tue, 8 Aug 2023 06:22:23 GMT, Thomas Stuefe <stuefe at openjdk.org> wrote:

>> OK let's go with your suggestion, thanks for explaining. I'm actually skeptical this can actually be a non-null bad pointer, as we've only seen this crash happen once, and the pointer was null in that instance. But this solution looks robust, so thanks for suggesting it.
>
> @richardstartin About async-safety: all supported (jit) architectures use static assembly as @theRealAph pointed out, these should be signal safe. The code snippet you found is only used by zero. You are probably not concerned with zero. And even there, yes, we longjmp out of signal handling, since there is no other way to implement SafeFetch in zero. That is technically async-sig-unsafe, but in practice it works and is tested for use in signal handlers.
> 
> About safety, @fisk is right in that this is still not completely safe since Method (and any of the objects chained to it that AGCT implicitly relies on being there, e.g. ConstMethod) can get out of scope while AGCT uses them.

@tstuefe @fisk I hadn't appreciated that the cause was probably concurrent method unloading, we don't have a core dump, just the backtrace from the crash and the disassembly from objdump, so all I knew was that the pointer was null but not why. This is not the sort of thing that reproduces readily. I don't have as much context about the adjacent JVM mechanisms as others in this thread and am just trying to fix a crash based on the evidence I have.

This pointer being null seems to be a symptom rather than a cause and it doesn't appear there's anything we can do about concurrent method unloading interfering with AsyncGetCallTrace, so I wonder how worthwhile attempting to fix this is. On the one hand it will crash another way sometimes, on the other hand the probability of this happening is significantly reduced to the subsequent usages of the pointer, whereas that window of time for unloading a method to cause a crash in AsyncGetCallTrace is currently the duration of the unwind preceding the current frame. Let me know what you think about proceeding and I'll submit a fix with the null check which would have been sufficient to avoid the observed segfault. Thanks.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/15178#discussion_r1286808560