Misbehaving exit status from Hotspot

Thu Jun 28 23:51:33 UTC 2018

On 29/06/2018 8:54 AM, Charles Oliver Nutter wrote:
> On Thu, Jun 28, 2018 at 5:34 PM, David Holmes <david.holmes at oracle.com 
> <mailto:david.holmes at oracle.com>> wrote:
> 
>     On 29/06/2018 6:30 AM, Charles Oliver Nutter wrote:
> 
>         But a TERM signal *did* cause the process to stop, didn't it?
>         This is not an appropriate value for the stop signal in that
>         case (indeed, garbage).
> 
> 
>     "stopping" a process is not "terminating" a process. ref: SIGSTOP,
>     SIGCONT ("job control").
> 
> 
> Ok.
> 
>         The process *did* terminate due to receipt of a signal. The
>         number of the signal that caused the termination of the process
>         was 15, TERM. I'm still not getting something in your logic, I gue
> 
> 
> 
>     Choosing to terminate in response to a signal you caught, is not the
>     same thing as being terminated (by the OS) because you did not catch
>     a signal that was sent to you.
> 
>         Bear with me please :-) There are big gaps in the documentation
>         of this stuff online, and if what Hotspot does is "correct" and
>         "standard" I would like to know that and see the docs indicating
>         such.
> 
> 
>     I can't speak to the accuracy of the Mac OS documentation. I'm
>     working from the POSIX specification and looking at how Linux
>     actually behaves.
> 
> 
> I'm honestly trying to look at every piece of doc I can find...MacOS, 
> Linux, POSIX. I guess what I'm starting to get is that the sequence of 
> events and resulting process state from handled signals is largely 
> unspecified and up to individual programs...

Yes - handled signals are handled by programs as they see fit.

> 
>     Obviously there is some leeway here as to how a process ultimately
>     chooses to facilitate termination. After catching a signal (any
>     signal) a process could raise a different signal that is not caught
>     and so have the OS terminate it - instead of calling exit(). That's
>     what that little code snipped your report links to shows - it
>     catches TERM, does what it needs to do, then removes the handler for
>     TERM, raises TERM against itself and so commits process suicide. It
>     gives the illusion that it was terminated by the initial TERM
>     signal, but it wasn't, it was terminated by the second, uncaught,
>     TERM signal.
> 
> 
> I think that's actually what I want, though, and I'm not entirely clear 
> why Hotspot can't do that.
> 
> If there's no handler, and you TERM an app, you get the results I've 
> shown above (the results I hope to get from TERMing Hotspot).
> 
> If there's a handler, then you have one of two options:
> 
> * Do whatever your handler needs to do and then re-propagate the signal 
> to the default handler to facilitate kernel-level shutdown and state 
> manipulation.

Ok - that is where I think the difference of perspective is. It isn't a 
"default handler" that needs to be chained. You don't need to 
"propagate" signals as if they were an exception you caught and now need 
to rethrow. There's no kernel-level shutdown you need to facilitate.

Signals are just a communication mechanism to inform the process about 
something that is happening - it might be a physical something (you just 
had a page fault!) or a logical something (you've been sent a request to 
terminate). But as described in Section "2.4.3 Signal Actions" here:

http://pubs.opengroup.org/onlinepubs/9699919799/functions/V2_chap02.html#tag_15_04

there are default actions associated with each signal, if a process 
doesn't provide an explicit action.

"If the default action is to terminate the process abnormally, the 
process is terminated as if by a call to _exit(), except that the status 
made available to wait(), waitid(), and waitpid() indicates abnormal 
termination by the signal."

So if you have no handler for TERM you will be terminated abnormally and 
its reported via WIF* (WIFEXITED will be false, and WIFSIGNALED will be 
true). If you catch the TERM signal and later choose to do a normal 
termination then that's up to you.

To be clear, I think the libc advise on this topic is just wrong-headed:

https://www.gnu.org/software/libc/manual/html_node/Termination-Signals.html

"Such a handler should end by specifying the default action for the 
signal that happened and then reraising it; this will cause the program 
to terminate with that signal, as if it had not had a handler."

It's not even self-consistent because it states this for the 
"termination signals" but when caught some of these signals 
intentionally do not trigger termination. So following that advice for 
SIGQUIT would be completely wrong for the JVM!

> * Do whatever your handler needs to do and NOT re-propagate the signal 
> to the default handler. Encode your own app-specific exit code.
> 
> Why couldn't Hotspot:
> 
> * Save the fact that termination was initiated by a TERM signal.
> * Do all appropriate shutdown for the JVM.
> * Prepare to exit process, see that shutdown was initiated by TERM, and 
> re-propagate TERM to the system handler rather than doing the "clean" 
> exit().
> 
> Command-line exit code would be right, WEXITSTATUS would indicate that 
> this was signal-initiated shutdown of the process,

WEXITSTATUS only has meaning if the process was _not_ terminated by a 
signal. If you re-raise (I won't say propagate) the signal then you 
won't be WIFEXITED but WIFSIGNALED.

> and the remaining 
> macros would indicate why. Everyone's happy!

Umm no, everyone is not happy! The appearance of an abnormal termination 
would suggest there was no orderly shutdown of the VM, when there was. 
An abnormal termination suggests to a sysadmin that something has gone 
badly wrong, when in this case it hasn't.

The way to look at this from a waitpid() perspective is this:

- first check if WIFEXITED
- if true then the JVM terminated of its own choice, so now check 
WIFEXITSTATUS to see why it chose to terminate

The 143 (being "signal number" + 128) reflects that it was responding to 
a shutdown request made via a TERM signal.

> This also explains the difference between CRuby, which I'm trying to 
> (a.k.a. forced to) emulate. CRuby *does* repropagate the signal after 
> doing its VM cleanup, and so the macros reflect a shutdown initiated by 
> an external signal.

That's their choice but I don't agree with it in general.

>     As I said there is no standard that says what happens for caught
>     signals, only what happens for uncaught signals - and uncaught
>     signals are all that counts for waitpid and WI* macros on exit
>     status. And you have to recognize that the macros are only valid
>     under specific conditions - no point reading WIFEXITSTATUS if the
>     process was WIFSIGNALED; no point reading WTERMSIG if the process
>     was WIFEXITED.
> 
>     It is up to an individual program to decide both what communication
>     mechanisms can lead to its termination (signals, poison-packets, GUI
>     events) and what exit status it reports in response to them. For the
>     JVM the exit status reflects how the shutdown request was
>     originated, so for signals it uses the "standard" convention of
>     "signal number" + 128. This is useful and necessary information for
>     a sysadmin.
> 
> 
> The ultimate answer may be "it's legal to do this, and Hotspot does it, 
> so Hotspot will continue to do it" but I'd really like to understand 
> *why* the signal doesn't get repropagated. It seems to me there's no 
> reason not to, since the command-line exit code would still be the same. 
> It would just make the other macros reflect the fact that the process 
> shut down in response to a signal (and yes, I know it's not an 
> *unhandled* signal...but why can't we make those macros work anyway?) 

To me you're trying to distort the meaning of those macros. They exist 
to distinguish between normal and abnormal terminations of a process. 
Where by definition uncaught signals (on a case by case basis) lead to 
abnormal termination as I quoted from POSIX.

> Most examples I've seen (of TERM-hooking shutdown) *do* repropagate, 
> presumably for the reasons I've raised in this thread.

I can't speak to other programs of an equivalent nature to the JVM. I'd 
be interested to know what other language runtimes like Self and 
Smalltalk did in this area. Or how a 30 year old UNIX programming 
textbook instructs you in this area. :)

But the JVM is not doing anything wrong and after 20+ years it's not 
going to change without a very good reason.

> Thanks for your patience!

Likewise.

Cheers,
David

> - Charlie