RFR(XXS): 8224793: os::die() does not honor CreateCoredumpOnCrash option

Thu May 30 15:31:32 UTC 2019

Hi David,

On 5/29/19 11:05 PM, David Holmes wrote:
> Hi Dan,
>
> On 30/05/2019 3:50 am, Daniel D. Daugherty wrote:
>> On 5/29/19 1:19 PM, Kim Barrett wrote:
>>>> On May 29, 2019, at 10:27 AM, Daniel D. Daugherty 
>>>> <daniel.daugherty at oracle.com> wrote:
>>>> On 5/28/19 9:03 PM, Kim Barrett wrote:
>>>>> And looking at this some more, I think I agree with David's 
>>>>> comment in
>>>>> the CR, that os::die *should* always dump core.  It gets called when
>>>>> things have gone horribly wrong, and a core dump might be the only 
>>>>> way
>>>>> to understand what went wrong.
>>>> I understand where you and David are coming from. […]
>>>>
>>>> To run a test with '-XX:-CreateCoredumpOnCrash' and then see a core
>>>> file in the execution directory violates the principle of least
>>>> astonishment. From the test author's POV, they did what they are
>>>> supposed to do (passing the '-XX:-CreateCoredumpOnCrash' option)
>>>> and the result is unexpected.
>>> The position that David and I are taking is that the unexpected core
>>> dump is not a problem, and is indeed a good thing, because the test
>>> apparently ran off into the weeds.
>>
>> The test in JDK-8188872 didn't run off into the weeds. It is testing
>> exactly what it is intending to test:
>>
>>    - If an error handling step times out, we get a message to tell us
>>      that the step timed out. We make sure we get at least two of those.
>>    - If error handling reaches the global timeout, then the VM should
>>      have been aborted by the WatcherThread. That's the intentional
>>      os::die() call...
>>
>> and I'm asserting that the test author knows what he/she is doing when
>> they specify the '-XX:-CreateCoredumpOnCrash' option. Please explain
>> why you think it is a good thing to ignore what the test writer asked
>> for and dump core when '-XX:-CreateCoredumpOnCrash' option is specified?
>
> Because we should not be changing the behaviour of os::die() because 
> of what one very specific test that intends to end up calling die() 
> desires.

Hmmmm... I'm proposing to change the behavior of os::die() because
we have an option that says this:

src/hotspot/share/runtime/globals.hpp:

>   product(bool, CreateCoredumpOnCrash, 
> true,                                \
>           "Create core/mini dump on VM fatal 
> error")                        \

so when we turn that option off, I'm asserting that we should not
create a core dump on a VM fatal error. I happened to run into the
issue with my work on JDK-8188872 and its specific test. The
messenger in this case is JDK-8188872. Don't shoot the messenger.

We have a general policy of last option wins when talking about
command line options. The general principal means that if an option
is specified on the command line it overrides any internal setting
for that option (within the constraints policy, of course). It also
means that if a command line option is specified more than once on
the command line, then the last setting wins, but that's not really
relevant to this discussion.

If I specify a command line option, I think it is reasonable to
expect that command line option to do what it claims to do. In this
particular instance, if I specify: '-XX:-CreateCoredumpOnCrash'
then it is reasonable to expect the opposite of this:

   Create core/mini dump on VM fatal error

which I read as:

   Do not create core/mini dump on VM fatal error

os::die() is a fatal error mechanism right? It's our way of telling
the VM to:

   // Die immediately, no exit hook, no abort hook, no cleanup.

Yes, that comment should be updated to include mentioning the
possibility of a core/mini-dump file. Do you disagree that os::die()
is a fatal error mechanism?

> os::die() means we've badly gone off the rails so we shouldn't pay any 
> attention to the CreateCoreDumpOnCrash.

Yes, you've said that more than once. So far you haven't said it with
any additional information or reasoning to make it anything more than
your opinion. Don't forget this was my opinion also when I first started
investigating JDK-8188872 and don't forget that I also inject os::die()
calls in places where I'm debugging the VM and I want it to die.

Let me put it another way:

The command line option -XX:-CreateCoreDumpOnCrash has been specified
which is a very compelling indication of what the user/test writer wants.
Why should I (the JVM engineer) decide that I know better than what the
user/test writer specified?

> If you change os::die() for this one test you lose the ability to 
> diagnose when things really did go off the rails.

That's a strawman argument. If I'm analyzing a test failure for a test
that has specified -XX:-CreateCoreDumpOnCrash, I can easily reenable
core dumps when I'm trying to repro the issue.

Also, a test that uses the -XX:-CreateCoreDumpOnCrash option is typically
written to look at stdout, stderr or hs_err_pid for very specific crash
information. If the test doesn't crash the way I expect (because things
went off the rails), my first move would be to disable 
CreateCoreDumpOnCrash,
get a core file and see where it really crashed this time.

> CreateCoredumpOnCrash (previously known as CreateMinidumpOnCrash and 
> used only on windows) is only used for the callers of os::abort to 
> decide whether to set "dump_core" to true or false.

I'm not sure what point you're trying to make here. I would expect
the flag to be used in calls to os::abort() in order to implement
the policy specified by the option.

I found a total of 32 tests that specify '-XX:-CreateCoredumpOnCrash'
so it is not just one test or test writer that is saying: "I don't care
about core files." We started adding this option to tests in order to
reduce test results disk footprint in those cases where the core file
was not considered important.

>
>>
>>>>> Maybe there are paths to os::die that ought to be calling os::abort
>>>>> instead?
>>>> Maybe, but not really the problem that I'm trying to solve today. :-)
>>> That presumes that the problem is the behavior of os::die rather than
>>> who is calling it.  David and I are suggesting the behavior of os::die
>>> is fine, and the reason for the unexpected core dump from the test is
>>> that os::die is being called when (perhaps) it shouldn't be. Or maybe
>>> the test really is going off into the weeds and a call to os::die is
>>> appropriate and the core dump might contain useful information.
>>
>> I believe in this case, that the author of the step timeout code,
>> Thomas Stüfe, is calling os::die() intentionally. As noted above, the
>> test is not off in the weeds.
>>
>> You edited out part of my reply where I said:
>>
>>> 1) os::die() doesn't say _anything_ about generating a core file:
>>>
>>>    src/hotspot/share/runtime/os.hpp:
>>>
>>>      // Die immediately, no exit hook, no abort hook, no cleanup.
>>>      static void die(); 
>>
>> and you didn't address my assertion that there is no contract for
>> os::die() to produce a core file. Remember that I'm also a user of
>> os::die() when I want a core file... So it surprised me that there's
>> no contract...
>
> Documentation is not exactly a strong point in hotspot code - lack of 
> it is hardly a basis for inferring that something need not happen.

Similarly, you (and I originally) were making this inference without a
basis in documentation:

 > os::die() means we've badly gone off the rails so we shouldn't pay 
any attention to the CreateCoreDumpOnCrash.

> os::die() calls ::abort and we know ::abort does a core dump.

Yup. Chasing down that 'proof' ended up being an adventure in
man page reading on the different platforms that I have in my
lab...

> Solaris even spells that out:
>
> ::abort(); // dump core (for debugging)

Yes, Solaris does and I saw that because the JDK-8188872 failure that
I'm chasing has only been seen on Solaris. I've thought about whether
to add the comment to the other os::die() instances, but we might be
moving 'os::die()' to os_posix.cpp and that would solve the issue (if
we retain the comment).

> So this lack of "contract" is easily fixed.

We should fix that in src/hotspot/runtime/os.hpp.

> The point is that this last ditch attempt to kill the process should 
> not be looking at flags like CreateCoredumpOnCrash.

Again, I'm going to return to the fact that the user/test-writer
specified '-XX:-CreateCoredumpOnCrash' option on the command line and
we (VM engineers) are deciding to ignore that. To me that is not
consistent with our general cmd line options policy.

Look at this way. We've found an inconsistency in cmd line option policy.
We could "fix" it by this change:

   product(bool, CreateCoredumpOnCrash, true,
           "Create core/mini dump on VM fatal error. "
           "os::die() ignores this option and always creates a core/mini 
dump.")

That would make it clear to users and test-writers that using
-XX:-CreateCoredumpOnCrash might work sometimes. Is that what
we want?

> Nor should its behaviour be dictated by a test!

That's a really interesting statement. And I do hope I'm
reading the underlying attitude wrong.

I see this as a cmd line option consistency issue so the behaviour
is dictated by the option and the user or test-writer that specifies
the option.

This has been a really interesting discussion. I was
originally going to keep the fix for this bug bundled up
with my fix for JDK-8188872. But it occurred to me that
this change really needed to be tracked on its own so
that's why I split it off.

Thanks for continuing the discussion.

Dan

>
> Thanks,
> David
> -----
>
>>>>> I think os::abort probably should not be calling ::exit either, but
>>>>> should be raising SIGKILL if no core dump is requested. os::exit is
>>>>> the path to exit functions and the like.
>>>> Agreed (or maybe not). If I'm in os::abort() and I call '::exit()',
>>>> am I calling 'os::exit()' or am I calling libc's exit()? I thought
>>>> I was calling libc's exit(), but I could be wrong…
>>> ::exit() is libc's exit(), which calls atexit handlers.
>>>
>>> Also, ::_exit() or _Exit() (the latter is in <stdlib.h>) might be
>>> alternatives to using SIGKILL.
>>
>> I'm testing SIGKILL right now... :-)
>>
>>
>>>
>>>>> It's unfortunate that the Linux version of os::abort needs to do
>>>>> something special for DumpPrivateMappingsInCore; without that, we
>>>>> could have a merged posix version of os::abort.
>>>> I think that's a fairly recent addition to the Linux version of
>>>> os::abort(). It could probably be handled by a posix version of
>>>> os::abort() that calls a PD version of os::abort() as needed or
>>>> something like that.
>>>>
>>>> Again, not the problem that I'm trying to solve today.
>>> Agreed.
>>>
>>
>> Thanks for continuing the discussion...
>>
>> Dan
>>