Misbehaving exit status from Hotspot

Thu Jun 28 22:34:24 UTC 2018

On 29/06/2018 6:30 AM, Charles Oliver Nutter wrote:
> 
> On Thu, Jun 28, 2018 at 12:24 AM, David Holmes <david.holmes at oracle.com 
> <mailto:david.holmes at oracle.com>> wrote:
> 
>     On 28/06/2018 1:33 AM, Charles Oliver Nutter wrote:
> 
>         Oops, in my editing of the post I lost the link to sources.
>         Perhaps this will illustrate the problem I'm talking about a bit
>         better!
> 
>         https://gist.github.com/headius/b87bc50b488fd73e753cbc518550ae5f
>         <https://gist.github.com/headius/b87bc50b488fd73e753cbc518550ae5f>
> 
> 
>     Okay so the test outputs:
> 
>     $ ./sigtest `which java` Loop
>     pid: 22136
>     status: 36608
>     exited: 1, stop signal: 143, term signal: 0, exit status: 143
>     WIFEXITED is defined as:
> 
>     "Evaluates to a non-zero value if status was returned for a child
>     process that terminated normally."
> 
>     So this value is one because the JVM process did exit "normally" -
>     it called exit(143);
> 
> 
> Why does it terminate "abnormally" if you -XX:+ReduceSignalUsage then? 
> Does it no longer handle TERM at all?

Correct. With -Xrs there is no handler for TERM and so the default 
behaviour is in force and the process is terminated due to an uncaught 
signal.

>     WSTOPSIG is defined as:
> 
>     "If the value of WIFSTOPPED(stat_val) is non-zero, this macro
>     evaluates to the number of the signal that caused the child process
>     to stop."
> 
>     But we haven't checked WIFSTOPPED (and the process is terminated not
>     stopped) so this is "garbage".
> 
> 
> But a TERM signal *did* cause the process to stop, didn't it? This is 
> not an appropriate value for the stop signal in that case (indeed, garbage).

"stopping" a process is not "terminating" a process. ref: SIGSTOP, 
SIGCONT ("job control").

>     WTERMSIG is defined as:
> 
>     "If the value of WIFSIGNALED(stat_val) is non-zero, this macro
>     evaluates to the number of the signal that caused the termination of
>     the child process."
> 
> 
>     You haven't checked WIFSIGNALED but it will be zero as the process
>     was not terminated by an _uncaught signal_. So the value zero here
>     is fine, but could be anything given WIFSIGNAD will be zero.
> 
> 
> I guess this is more of the same...if you handle a TERM signal, but then 
> ultimately do terminate...wasn't the termination in response to the TERM 
> signal? If I did not send TERM, it would not have shut down.

The process did not terminate due to an _uncaught signal_. The process 
chose to terminate in response to the signal that was sent and caught. 
Signals are just a communication mechanism made available to the 
process. The TERM signal requests a process to terminate. If the process 
has not installed a handler for TERM then the OS will terminate it. If 
the process has installed a handler for TERM then the handler is invoked 
and the process is free to do what it likes - it can choose to terminate 
immediately, make a note to terminate when convenient, or choose to 
ignore it completely, or whatever. That's why will also have KILL and 
ABRT, because TERM is cooperative.

> My docs, on MacOS, say something similar: "True if the process 
> terminated due to receipt of a signal."

This OS X doc:

https://www.unix.com/man-page/osx/2/waitpid/

also states before that:

WIFEXITED(status)
   True if the process terminated normally by a call to _exit(2) or exit(3).

The JVM did call exit(3) hence WIFEXITED is true. WIFEXITED and 
WIFSTOPPED are mutually exclusive. IF you take the two definitions 
together then a process that calls exit won't be WIFSTOPPED

> The process *did* terminate due to receipt of a signal. The number of 
> the signal that caused the termination of the process was 15, TERM. I'm 
> still not getting something in your logic, I guess?

Choosing to terminate in response to a signal you caught, is not the 
same thing as being terminated (by the OS) because you did not catch a 
signal that was sent to you.

> Bear with me please :-) There are big gaps in the documentation of this 
> stuff online, and if what Hotspot does is "correct" and "standard" I 
> would like to know that and see the docs indicating such.

I can't speak to the accuracy of the Mac OS documentation. I'm working 
from the POSIX specification and looking at how Linux actually behaves.

Obviously there is some leeway here as to how a process ultimately 
chooses to facilitate termination. After catching a signal (any signal) 
a process could raise a different signal that is not caught and so have 
the OS terminate it - instead of calling exit(). That's what that little 
code snipped your report links to shows - it catches TERM, does what it 
needs to do, then removes the handler for TERM, raises TERM against 
itself and so commits process suicide. It gives the illusion that it was 
terminated by the initial TERM signal, but it wasn't, it was terminated 
by the second, uncaught, TERM signal.

>     WEXITSTATUS is defined as:
> 
>     "If the value of WIFEXITED(stat_val) is non-zero, this macro
>     evaluates to the low-order 8 bits of the status argument that the
>     child process passed to _exit() or exit(), or the value the child
>     process returned from main()."
> 
>     The JVM called exit(143) so we expect to get 143 and that's exactly
>     what we do get.
> 
> 
> Here's more confusion for me: WEXITSTATUS is different from what $? 
> would be at a command line, correct? Because for the C program, 
> WEXITSTATUS is 0 and the exit code at command line is 143.

WEXITSTATUS is only valid if WIFEXITED(stat_val) is non-zero. In the C 
program you were terminated by an uncaught signal so WIFEXITED is zero, 
so WEXITSTATUS is "garbage".

> I am not arguing that the exit(143) is really *wrong*...it just doesn't 
> really seem to match the rest of the state.
> 
> Here's my logic:
> 
> Command-line exit result of 143 is "ok"...standard says that it should 
> be 128+N for that value when a signal N caused the process to end.
> 
> However, none of the OTHER state that would indicate a signal-based 
> termination are set properly. There's no termsig and stopsig is 
> nonsense. So we get a non-zero exit code indicating that a signal caused 
> the process to end, and yet none of the W macros produce the right 
> values to give us more information. So, did it terminate because of a 
> signal or not? One result says yes, the other result says no.
> 
> If this is standard, I would like to see that standard. This is where 
> our bug reports come from...these results do not match any other 
> programs our users are managing via signals+waitpid.
> 
>     The flaw with your thinking here is that sending a signal to tell
>     the VM to terminate should behave as-if the VM received (and
>     terminated due to) an uncaught signal. It doesn't - nor should it.
> 
> 
> Is that really the flaw in my thinking? I don't care if these are caught 
> or uncaught signals...should I? Is there a spec that says *caught* 
> signals used for a clean shutdown show now indicate the process exited 
> normally? I don't get it.

There is no spec that says how a process needs to behave after catching 
a signal.

> If would not have stopped if I had not sent the signal, and so I believe 
> the macros above should indicate that. If it was a normal termination, 
> I'd expect the exit status to be zero. But exit status indicates 
> shutdown was *not* normal...it was due to a signal...but then the signal 
> macros for waitpid don't also reflect that.
> 
> Again, if there's a standard documenting that this is typically how 
> TERM-handling C programs are supposed to work, I would be happy to be 
> reeducated. But at the moment, the numbers aren't lining up for me.

As I said there is no standard that says what happens for caught 
signals, only what happens for uncaught signals - and uncaught signals 
are all that counts for waitpid and WI* macros on exit status. And you 
have to recognize that the macros are only valid under specific 
conditions - no point reading WIFEXITSTATUS if the process was 
WIFSIGNALED; no point reading WTERMSIG if the process was WIFEXITED.

It is up to an individual program to decide both what communication 
mechanisms can lead to its termination (signals, poison-packets, GUI 
events) and what exit status it reports in response to them. For the JVM 
the exit status reflects how the shutdown request was originated, so for 
signals it uses the "standard" convention of "signal number" + 128. This 
is useful and necessary information for a sysadmin.

David

> - Charlie