RFR: 8350111: [PPC] AsyncGetCallTrace crashes when called while handling SIGTRAP

Wed Feb 26 09:31:53 UTC 2025

On Wed, 26 Feb 2025 09:20:49 GMT, Richard Reingruber <rrich at openjdk.org> wrote:

>>> > Can this also happen on other platforms when in signal handling (e.g. segfault based nullchecks?)
>>> 
>>> I guess such problems can happen on all platforms which use some kind of link register (aarch64, s390, ?).
>> 
>> The actual issue here is that an attempt to walk native stack frames fails and we don't recognize that the stack is not walkable for our stackwalking code. The concrete problem is (likely) that caller pc was not yet stored to the stack. This specific problem cannot occur on x86 (caller pc passed on stack) but also there pushing a new frame isn't atomic and there are states where our stackwalking code can crash I'm sure.
>> 
>>>I also don't like that we lose so many samples with this current solution. I only approved it because I think it is better than crashing.
>>> Recognizing that a signal handler is on stack may be a better solution.
>> 
>> This would avoid this specific type of crash.
>> Attempts to walk native frames until the top java frame is found can fail, though, in similar ways.
>> That's what I meant referring to ffi calls in the pr description.
>> 
>>> Do we already have functionality for that? There are efforts to read the stack at a safepoint. @parttimenerd: Would it make sense to wait for that?
>> 
>> With that enhancement we would capture the top java frame (sp, pc) in the signal handler too and then do the stack walk at the safepoint. Finding the top java frame is the purpose of [find_initial_Java_frame](https://github.com/openjdk/jdk/blob/037e47112bdf2fa2324f7c58198f6d433f17d9fd/src/hotspot/share/prims/forte.cpp#L271) but it crashes and would also crash with the walk of java frames delayed to the next safepoint.
>> It would only help if we would use the java frame (sp, pc) we find on top at the safepoint but doing so you loose precision, e.g. if you where in an critical ffi call when the thread was interrupted then you would loose this information.
>
>> I also don't like that we lose so many samples with this current solution.
> 
> That worries me too (see pr descr.).
> It might be possible to handle this situation better in `frame::safe_for_sender` if we only make sure that the sender pc is not null.
> I was worried about the case where sender pc is random but within the code cache. This seems to be handled though in `find_initial_Java_frame`.

@reinrich @TheRealMDoerr Thank you for the explanations.

> Recognizing that a signal handler is on stack may be a better solution.

I think the SIGTRAP handler should block SIGPROF or SIGVTALARM (whatever 26 is on linux ppc). This should be possible since SIGPROF is asynchronous.

And if we enter the SIGTRAP jvm handler via the normal path (JVM gets SIGTRAP), this is already done. All signals that are not synchronous error signals are blocked, which should include SIGPROF. However, if we enter the signal handling via chaining (in this case, via async_profiler::trap_handler), nothing is blocked. At least I don't see any setup for it.

The simple solution could be to just block SIGPROF for the current thread when entering the JVM signal handler. A better fix would be for async profiler to block SIGPROF in its trap handler (when setting up the sigaction).

-------------

PR Comment: https://git.openjdk.org/jdk/pull/23641#issuecomment-2684407552