RFR: 8359820: Improve handshake/safepoint timeout diagnostic messages [v3]

Fri Jul 18 13:24:53 UTC 2025

On Fri, 18 Jul 2025 10:46:45 GMT, Anton Artemov <duke at openjdk.org> wrote:

>>> > I have nothing against making this clearer, just saying that the info is already kind of there.
>>> 
>>> Yes, but there are a few reasons a SIGILL may have been fired at a thread, so it would be useful to get a clear indication of exactly why it happened. Not everyone will realize/know that SIGILL happens because of a timeout in handshake/safepoint.
>> 
>> I still don't get it.
>> 
>> - Is the receiving thread the one meant to receive the SIGILL? Then why print this at all, we have callstack and thread info already?
>> - Is the receiving thread not the originally intended recipient? But how? This can only happen either if the original recipient thread blocked - which we don't do in hotspot code AFAIK, so it could only be a library method that temporary sets a signal mask - or if there is a bug in the sending code - in which case we should fix it?
>> - Is the SIGILL completely unrelated to the safepoint? Then why print the information?
>
>> * Is the receiving thread the one meant to receive the SIGILL? Then why print this at all, we have callstack and thread info already?
> 
> Yes, the receiving thread is the one to receive the SIGILL. I agree that the changes introduce a degree of redundancy, but it is difficult to see by looking at the thread callstack that it was killed by the timeout mechanism of the handshake. I found it by looking at events log, see the discussion in JBS. 
> 
> 
>> * Is the receiving thread not the originally intended recipient? But how? This can only happen either if the original recipient thread blocked - which we don't do in hotspot code AFAIK, so it could only be a library method that temporary sets a signal mask - or if there is a bug in the sending code - in which case we should fix it?
> 
> I think I already described a possible situation: if the receiver does not report the crash within 3 seconds, then a fatal error will be reported by the calling thread. However, it may happen that any other thread receives SIGILL for any other reason within that time interval. But the "busy" thread is already in the "communicative" variable, which will not be the signal receiver in this particular case. I do not really know if this situation is just hypothetical or ever occurred in practice.
> 
>> * Is the SIGILL completely unrelated to the safepoint? Then why print the information?
> 
> No, it is intentionally fired by the timeout handler. Quote from mr. Shipilev, see the issue discussion: "The intent for SIGILL is to trigger the crash at the thread that blocks handshake/safepoint sync. E.g. a Java thread that is stuck on miscompiled loop without safepoint checks. Or some VM code that spins without VM transitions. See [JDK-8219584](https://bugs.openjdk.org/browse/JDK-8219584). This feature is remarkably useful in the field, used this dozens of times. So whatever we do, we need to keep printing the instructions block and hopefully a backtrace."

@toxaart Thank you for laying it out to me.

So SDE slowed things down, we did not reach the safepoint, the timeout mechanism fired a SIGILL to the slow thread, the slow thread was not fast enough to end the JVM, and the sending thread then executed the fatal("Handshake timeout")? 

Which thread won the hs-err pid writing?  Was (A) the winner, and it started error handling? And you maybe saw a "Thread XXX also had an error" line from the sending thread? 

Or (B) did the slow thread not even get the signal, and the hs_err file you got was from the fatal("Handshake timeout") in the sending thread?

(B) would be odd; a signal sent with pthread_kill had not been delivered to the target thread for three seconds :-( is signal delivery on SDE broken, or is it just really slow?

In any case, I think I understand now that you try to improve the hs_err printout for case (A), right? If so, sure, that makes sense.

What confused me was your printout `timed out thread XXXXX`, which suggested to me that the receiving thread could be a different one from the one that should have been interrupted. But I missed the `if (handshakeTimedOutThread == p2i(_thread))` condition right above that.

> > * Is the receiving thread the one meant to receive the SIGILL? Then why print this at all, we have callstack and thread info already?
> 
> Yes, the receiving thread is the one to receive the SIGILL. I agree that the changes introduce a degree of redundancy, but it is difficult to see by looking at the thread callstack that it was killed by the timeout mechanism of the handshake. I found it by looking at events log, see the discussion in JBS.

No problem, its fine to make things clearer.

> 
> > * Is the receiving thread not the originally intended recipient? But how? This can only happen either if the original recipient thread blocked - which we don't do in hotspot code AFAIK, so it could only be a library method that temporary sets a signal mask - or if there is a bug in the sending code - in which case we should fix it?
> 
> I think I already described a possible situation: if the receiver does not report the crash within 3 seconds, then a fatal error will be reported by the calling thread.

To help with this case, I suggest a simple addition in handshake.cpp:

-  fatal("Handshake timeout");
+  fatal("Thread " PTR_FORMAT " has not cleared handshake op: " PTR_FORMAT ", then failed to terminate JVM",
+        p2i(target), p2i(op));
 }

which will show a clearer message at the start of the hs-err file, in case we don't have the VM output.

> However, it may happen that any other thread receives SIGILL for any other reason within that time interval. But the "busy" thread is already in the "communicative" variable, which will not be the signal receiver in this particular case. I do not really know if this situation is just hypothetical or ever occurred in practice.

Yes, it could happen. The mechanism could be improved by storing the fact that a SIGILL has been sent to thread X not in a global variable but in the `Thread` structure of X. Then, in VMError, one checks if the current thread had been the target of a recent pthread_kill, and only write "sent by xxx" in that case. I ignore here the possible case of multiple senders one receiver, because I think that is extremely unlikely.

Not saying you have to do this. Can also be done in a later RFE.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/26309#issuecomment-3089471611