RFR: 8359820: Improve handshake/safepoint timeout diagnostic messages [v3]
Thomas Stuefe
stuefe at openjdk.org
Fri Jul 18 13:36:49 UTC 2025
On Fri, 18 Jul 2025 13:22:02 GMT, Thomas Stuefe <stuefe at openjdk.org> wrote:
>>> * Is the receiving thread the one meant to receive the SIGILL? Then why print this at all, we have callstack and thread info already?
>>
>> Yes, the receiving thread is the one to receive the SIGILL. I agree that the changes introduce a degree of redundancy, but it is difficult to see by looking at the thread callstack that it was killed by the timeout mechanism of the handshake. I found it by looking at events log, see the discussion in JBS.
>>
>>
>>> * Is the receiving thread not the originally intended recipient? But how? This can only happen either if the original recipient thread blocked - which we don't do in hotspot code AFAIK, so it could only be a library method that temporary sets a signal mask - or if there is a bug in the sending code - in which case we should fix it?
>>
>> I think I already described a possible situation: if the receiver does not report the crash within 3 seconds, then a fatal error will be reported by the calling thread. However, it may happen that any other thread receives SIGILL for any other reason within that time interval. But the "busy" thread is already in the "communicative" variable, which will not be the signal receiver in this particular case. I do not really know if this situation is just hypothetical or ever occurred in practice.
>>
>>> * Is the SIGILL completely unrelated to the safepoint? Then why print the information?
>>
>> No, it is intentionally fired by the timeout handler. Quote from mr. Shipilev, see the issue discussion: "The intent for SIGILL is to trigger the crash at the thread that blocks handshake/safepoint sync. E.g. a Java thread that is stuck on miscompiled loop without safepoint checks. Or some VM code that spins without VM transitions. See [JDK-8219584](https://bugs.openjdk.org/browse/JDK-8219584). This feature is remarkably useful in the field, used this dozens of times. So whatever we do, we need to keep printing the instructions block and hopefully a backtrace."
>
> @toxaart Thank you for laying it out to me.
>
> So SDE slowed things down, we did not reach the safepoint, the timeout mechanism fired a SIGILL to the slow thread, the slow thread was not fast enough to end the JVM, and the sending thread then executed the fatal("Handshake timeout")?
>
> Which thread won the hs-err pid writing? Was (A) the winner, and it started error handling? And you maybe saw a "Thread XXX also had an error" line from the sending thread?
>
> Or (B) did the slow thread not even get the signal, and the hs_err file you got was from the fatal("Handshake timeout") in the sending thread?
>
> (B) would be odd; a signal sent with pthread_kill had not been delivered to the target thread for three seconds :-( is signal delivery on SDE broken, or is it just really slow?
>
> In any case, I think I understand now that you try to improve the hs_err printout for case (A), right? If so, sure, that makes sense.
>
> What confused me was your printout `timed out thread XXXXX`, which suggested to me that the receiving thread could be a different one from the one that should have been interrupted. But I missed the `if (handshakeTimedOutThread == p2i(_thread))` condition right above that.
>
>> > * Is the receiving thread the one meant to receive the SIGILL? Then why print this at all, we have callstack and thread info already?
>>
>> Yes, the receiving thread is the one to receive the SIGILL. I agree that the changes introduce a degree of redundancy, but it is difficult to see by looking at the thread callstack that it was killed by the timeout mechanism of the handshake. I found it by looking at events log, see the discussion in JBS.
>
> No problem, its fine to make things clearer.
>
>>
>> > * Is the receiving thread not the originally intended recipient? But how? This can only happen either if the original recipient thread blocked - which we don't do in hotspot code AFAIK, so it could only be a library method that temporary sets a signal mask - or if there is a bug in the sending code - in which case we should fix it?
>>
>> I think I already described a possible situation: if the receiver does not report the crash within 3 seconds, then a fatal error will be reported by the calling thread.
>
> To help with this case, I suggest a simple addition in handshake.cpp:
>
>
> - fatal("Handshake timeout");
> + fatal("Thread " PTR_FORMAT " has not cleared handshake op: " PTR_FORMAT ", then failed to terminate JVM",
> + p2i(target), p2i(op));
> }
>
>
> which will show a clearer message at the start...
> Not sure what is so hard to understand here @tstuefe . A thread is hit with a SIGILL and we report that now, but we don't report _why_ it was hit with the SIGILL. If there were only one reason (like it executed an illegal instruction) then it would be obvious, but we have hijacked SIGILL as a generic "something happened" signal. So the proposal here is to record the identity of the thread being sent a SIGILL due to a handshake or safepoint timeout, so that when that thread responds to the SIGILL it can see that is why it got it and report that fact. If a different thread also got a SIGILL for a different reason we don't want it reporting it was due to the timeout mechanism.
Thank you, @dholmes-ora . I already answered Anton, but I get that now.
-------------
PR Comment: https://git.openjdk.org/jdk/pull/26309#issuecomment-3089503256
More information about the hotspot-runtime-dev
mailing list