Misbehaving exit status from Hotspot

Fri Jun 29 06:04:56 UTC 2018

Hi David, Charles,

I always found it good practice to repropagate SIGTERM when handled.
But no, there is no hard rule about this, apart from the suggestion
the glibc documentation states (and everyone seems to read, therefore
this has become the standard practice among GNU tools at least).

Since one can see it either way and the difference is philosophical
("does the fact that we grabbed control to clean up before terminating
make this not a termination anymore?"), for me the question is what we
would gain by adapting the GNU behavior.

And here I can see such a change being disruptive. So, while I would
like to behave like GNU tools, I am at least skeptical that we can
simply change this, since this would be a break in compatibility.

Cheers, Thomas

On Fri, Jun 29, 2018 at 1:51 AM, David Holmes <david.holmes at oracle.com> wrote:
> On 29/06/2018 8:54 AM, Charles Oliver Nutter wrote:
>>
>> On Thu, Jun 28, 2018 at 5:34 PM, David Holmes <david.holmes at oracle.com
>> <mailto:david.holmes at oracle.com>> wrote:
>>
>>     On 29/06/2018 6:30 AM, Charles Oliver Nutter wrote:
>>
>>         But a TERM signal *did* cause the process to stop, didn't it?
>>         This is not an appropriate value for the stop signal in that
>>         case (indeed, garbage).
>>
>>
>>     "stopping" a process is not "terminating" a process. ref: SIGSTOP,
>>     SIGCONT ("job control").
>>
>>
>> Ok.
>>
>>         The process *did* terminate due to receipt of a signal. The
>>         number of the signal that caused the termination of the process
>>         was 15, TERM. I'm still not getting something in your logic, I gue
>>
>>
>>
>>     Choosing to terminate in response to a signal you caught, is not the
>>     same thing as being terminated (by the OS) because you did not catch
>>     a signal that was sent to you.
>>
>>         Bear with me please :-) There are big gaps in the documentation
>>         of this stuff online, and if what Hotspot does is "correct" and
>>         "standard" I would like to know that and see the docs indicating
>>         such.
>>
>>
>>     I can't speak to the accuracy of the Mac OS documentation. I'm
>>     working from the POSIX specification and looking at how Linux
>>     actually behaves.
>>
>>
>> I'm honestly trying to look at every piece of doc I can find...MacOS,
>> Linux, POSIX. I guess what I'm starting to get is that the sequence of
>> events and resulting process state from handled signals is largely
>> unspecified and up to individual programs...
>
>
> Yes - handled signals are handled by programs as they see fit.
>
>>
>>     Obviously there is some leeway here as to how a process ultimately
>>     chooses to facilitate termination. After catching a signal (any
>>     signal) a process could raise a different signal that is not caught
>>     and so have the OS terminate it - instead of calling exit(). That's
>>     what that little code snipped your report links to shows - it
>>     catches TERM, does what it needs to do, then removes the handler for
>>     TERM, raises TERM against itself and so commits process suicide. It
>>     gives the illusion that it was terminated by the initial TERM
>>     signal, but it wasn't, it was terminated by the second, uncaught,
>>     TERM signal.
>>
>>
>> I think that's actually what I want, though, and I'm not entirely clear
>> why Hotspot can't do that.
>>
>> If there's no handler, and you TERM an app, you get the results I've shown
>> above (the results I hope to get from TERMing Hotspot).
>>
>> If there's a handler, then you have one of two options:
>>
>> * Do whatever your handler needs to do and then re-propagate the signal to
>> the default handler to facilitate kernel-level shutdown and state
>> manipulation.
>
>
> Ok - that is where I think the difference of perspective is. It isn't a
> "default handler" that needs to be chained. You don't need to "propagate"
> signals as if they were an exception you caught and now need to rethrow.
> There's no kernel-level shutdown you need to facilitate.
>
> Signals are just a communication mechanism to inform the process about
> something that is happening - it might be a physical something (you just had
> a page fault!) or a logical something (you've been sent a request to
> terminate). But as described in Section "2.4.3 Signal Actions" here:
>
> http://pubs.opengroup.org/onlinepubs/9699919799/functions/V2_chap02.html#tag_15_04
>
> there are default actions associated with each signal, if a process doesn't
> provide an explicit action.
>
> "If the default action is to terminate the process abnormally, the process
> is terminated as if by a call to _exit(), except that the status made
> available to wait(), waitid(), and waitpid() indicates abnormal termination
> by the signal."
>
> So if you have no handler for TERM you will be terminated abnormally and its
> reported via WIF* (WIFEXITED will be false, and WIFSIGNALED will be true).
> If you catch the TERM signal and later choose to do a normal termination
> then that's up to you.
>
> To be clear, I think the libc advise on this topic is just wrong-headed:
>
> https://www.gnu.org/software/libc/manual/html_node/Termination-Signals.html
>
> "Such a handler should end by specifying the default action for the signal
> that happened and then reraising it; this will cause the program to
> terminate with that signal, as if it had not had a handler."
>
> It's not even self-consistent because it states this for the "termination
> signals" but when caught some of these signals intentionally do not trigger
> termination. So following that advice for SIGQUIT would be completely wrong
> for the JVM!
>
>> * Do whatever your handler needs to do and NOT re-propagate the signal to
>> the default handler. Encode your own app-specific exit code.
>>
>> Why couldn't Hotspot:
>>
>> * Save the fact that termination was initiated by a TERM signal.
>> * Do all appropriate shutdown for the JVM.
>> * Prepare to exit process, see that shutdown was initiated by TERM, and
>> re-propagate TERM to the system handler rather than doing the "clean"
>> exit().
>>
>> Command-line exit code would be right, WEXITSTATUS would indicate that
>> this was signal-initiated shutdown of the process,
>
>
> WEXITSTATUS only has meaning if the process was _not_ terminated by a
> signal. If you re-raise (I won't say propagate) the signal then you won't be
> WIFEXITED but WIFSIGNALED.
>
>> and the remaining macros would indicate why. Everyone's happy!
>
>
> Umm no, everyone is not happy! The appearance of an abnormal termination
> would suggest there was no orderly shutdown of the VM, when there was. An
> abnormal termination suggests to a sysadmin that something has gone badly
> wrong, when in this case it hasn't.
>
> The way to look at this from a waitpid() perspective is this:
>
> - first check if WIFEXITED
> - if true then the JVM terminated of its own choice, so now check
> WIFEXITSTATUS to see why it chose to terminate
>
> The 143 (being "signal number" + 128) reflects that it was responding to a
> shutdown request made via a TERM signal.
>
>> This also explains the difference between CRuby, which I'm trying to
>> (a.k.a. forced to) emulate. CRuby *does* repropagate the signal after doing
>> its VM cleanup, and so the macros reflect a shutdown initiated by an
>> external signal.
>
>
> That's their choice but I don't agree with it in general.
>
>>     As I said there is no standard that says what happens for caught
>>     signals, only what happens for uncaught signals - and uncaught
>>     signals are all that counts for waitpid and WI* macros on exit
>>     status. And you have to recognize that the macros are only valid
>>     under specific conditions - no point reading WIFEXITSTATUS if the
>>     process was WIFSIGNALED; no point reading WTERMSIG if the process
>>     was WIFEXITED.
>>
>>     It is up to an individual program to decide both what communication
>>     mechanisms can lead to its termination (signals, poison-packets, GUI
>>     events) and what exit status it reports in response to them. For the
>>     JVM the exit status reflects how the shutdown request was
>>     originated, so for signals it uses the "standard" convention of
>>     "signal number" + 128. This is useful and necessary information for
>>     a sysadmin.
>>
>>
>> The ultimate answer may be "it's legal to do this, and Hotspot does it, so
>> Hotspot will continue to do it" but I'd really like to understand *why* the
>> signal doesn't get repropagated. It seems to me there's no reason not to,
>> since the command-line exit code would still be the same. It would just make
>> the other macros reflect the fact that the process shut down in response to
>> a signal (and yes, I know it's not an *unhandled* signal...but why can't we
>> make those macros work anyway?)
>
>
> To me you're trying to distort the meaning of those macros. They exist to
> distinguish between normal and abnormal terminations of a process. Where by
> definition uncaught signals (on a case by case basis) lead to abnormal
> termination as I quoted from POSIX.
>
>> Most examples I've seen (of TERM-hooking shutdown) *do* repropagate,
>> presumably for the reasons I've raised in this thread.
>
>
> I can't speak to other programs of an equivalent nature to the JVM. I'd be
> interested to know what other language runtimes like Self and Smalltalk did
> in this area. Or how a 30 year old UNIX programming textbook instructs you
> in this area. :)
>
> But the JVM is not doing anything wrong and after 20+ years it's not going
> to change without a very good reason.
>
>> Thanks for your patience!
>
>
> Likewise.
>
> Cheers,
> David
>
>> - Charlie