Misbehaving exit status from Hotspot

Thu Jun 28 22:54:07 UTC 2018

On Thu, Jun 28, 2018 at 5:34 PM, David Holmes <david.holmes at oracle.com>
wrote:

> On 29/06/2018 6:30 AM, Charles Oliver Nutter wrote:
>
>> But a TERM signal *did* cause the process to stop, didn't it? This is not
>> an appropriate value for the stop signal in that case (indeed, garbage).
>>
>
> "stopping" a process is not "terminating" a process. ref: SIGSTOP, SIGCONT
> ("job control").
>

Ok.

The process *did* terminate due to receipt of a signal. The number of the
>> signal that caused the termination of the process was 15, TERM. I'm still
>> not getting something in your logic, I gue
>
>
>
> Choosing to terminate in response to a signal you caught, is not the same
> thing as being terminated (by the OS) because you did not catch a signal
> that was sent to you.
>
> Bear with me please :-) There are big gaps in the documentation of this
>> stuff online, and if what Hotspot does is "correct" and "standard" I would
>> like to know that and see the docs indicating such.
>>
>
> I can't speak to the accuracy of the Mac OS documentation. I'm working
> from the POSIX specification and looking at how Linux actually behaves.
>

I'm honestly trying to look at every piece of doc I can find...MacOS,
Linux, POSIX. I guess what I'm starting to get is that the sequence of
events and resulting process state from handled signals is largely
unspecified and up to individual programs...

> Obviously there is some leeway here as to how a process ultimately chooses
> to facilitate termination. After catching a signal (any signal) a process
> could raise a different signal that is not caught and so have the OS
> terminate it - instead of calling exit(). That's what that little code
> snipped your report links to shows - it catches TERM, does what it needs to
> do, then removes the handler for TERM, raises TERM against itself and so
> commits process suicide. It gives the illusion that it was terminated by
> the initial TERM signal, but it wasn't, it was terminated by the second,
> uncaught, TERM signal.
>

I think that's actually what I want, though, and I'm not entirely clear why
Hotspot can't do that.

If there's no handler, and you TERM an app, you get the results I've shown
above (the results I hope to get from TERMing Hotspot).

If there's a handler, then you have one of two options:

* Do whatever your handler needs to do and then re-propagate the signal to
the default handler to facilitate kernel-level shutdown and state
manipulation.
* Do whatever your handler needs to do and NOT re-propagate the signal to
the default handler. Encode your own app-specific exit code.

Why couldn't Hotspot:

* Save the fact that termination was initiated by a TERM signal.
* Do all appropriate shutdown for the JVM.
* Prepare to exit process, see that shutdown was initiated by TERM, and
re-propagate TERM to the system handler rather than doing the "clean"
exit().

Command-line exit code would be right, WEXITSTATUS would indicate that this
was signal-initiated shutdown of the process, and the remaining macros
would indicate why. Everyone's happy!

This also explains the difference between CRuby, which I'm trying to
(a.k.a. forced to) emulate. CRuby *does* repropagate the signal after doing
its VM cleanup, and so the macros reflect a shutdown initiated by an
external signal.

> As I said there is no standard that says what happens for caught signals,
> only what happens for uncaught signals - and uncaught signals are all that
> counts for waitpid and WI* macros on exit status. And you have to recognize
> that the macros are only valid under specific conditions - no point reading
> WIFEXITSTATUS if the process was WIFSIGNALED; no point reading WTERMSIG if
> the process was WIFEXITED.
>
> It is up to an individual program to decide both what communication
> mechanisms can lead to its termination (signals, poison-packets, GUI
> events) and what exit status it reports in response to them. For the JVM
> the exit status reflects how the shutdown request was originated, so for
> signals it uses the "standard" convention of "signal number" + 128. This is
> useful and necessary information for a sysadmin.
>

The ultimate answer may be "it's legal to do this, and Hotspot does it, so
Hotspot will continue to do it" but I'd really like to understand *why* the
signal doesn't get repropagated. It seems to me there's no reason not to,
since the command-line exit code would still be the same. It would just
make the other macros reflect the fact that the process shut down in
response to a signal (and yes, I know it's not an *unhandled* signal...but
why can't we make those macros work anyway?) Most examples I've seen (of
TERM-hooking shutdown) *do* repropagate, presumably for the reasons I've
raised in this thread.

Thanks for your patience!

- Charlie