RFR(XXS): 8224793: os::die() does not honor CreateCoredumpOnCrash option

Fri May 31 09:18:56 UTC 2019

Hi Dan,

A few responses below.

On 31/05/2019 1:31 am, Daniel D. Daugherty wrote:
> Hi David,
> 
> On 5/29/19 11:05 PM, David Holmes wrote:
>> Hi Dan,
>>
>> On 30/05/2019 3:50 am, Daniel D. Daugherty wrote:
>>> On 5/29/19 1:19 PM, Kim Barrett wrote:
>>>>> On May 29, 2019, at 10:27 AM, Daniel D. Daugherty 
>>>>> <daniel.daugherty at oracle.com> wrote:
>>>>> On 5/28/19 9:03 PM, Kim Barrett wrote:
>>>>>> And looking at this some more, I think I agree with David's 
>>>>>> comment in
>>>>>> the CR, that os::die *should* always dump core.  It gets called when
>>>>>> things have gone horribly wrong, and a core dump might be the only 
>>>>>> way
>>>>>> to understand what went wrong.
>>>>> I understand where you and David are coming from. […]
>>>>>
>>>>> To run a test with '-XX:-CreateCoredumpOnCrash' and then see a core
>>>>> file in the execution directory violates the principle of least
>>>>> astonishment. From the test author's POV, they did what they are
>>>>> supposed to do (passing the '-XX:-CreateCoredumpOnCrash' option)
>>>>> and the result is unexpected.
>>>> The position that David and I are taking is that the unexpected core
>>>> dump is not a problem, and is indeed a good thing, because the test
>>>> apparently ran off into the weeds.
>>>
>>> The test in JDK-8188872 didn't run off into the weeds. It is testing
>>> exactly what it is intending to test:
>>>
>>>    - If an error handling step times out, we get a message to tell us
>>>      that the step timed out. We make sure we get at least two of those.
>>>    - If error handling reaches the global timeout, then the VM should
>>>      have been aborted by the WatcherThread. That's the intentional
>>>      os::die() call...
>>>
>>> and I'm asserting that the test author knows what he/she is doing when
>>> they specify the '-XX:-CreateCoredumpOnCrash' option. Please explain
>>> why you think it is a good thing to ignore what the test writer asked
>>> for and dump core when '-XX:-CreateCoredumpOnCrash' option is specified?
>>
>> Because we should not be changing the behaviour of os::die() because 
>> of what one very specific test that intends to end up calling die() 
>> desires.
> 
> Hmmmm... I'm proposing to change the behavior of os::die() because
> we have an option that says this:
> 
> src/hotspot/share/runtime/globals.hpp:
> 
>>   product(bool, CreateCoredumpOnCrash, 
>> true,                                \
>>           "Create core/mini dump on VM fatal 
>> error")                        \
> 
> so when we turn that option off, I'm asserting that we should not
> create a core dump on a VM fatal error. I happened to run into the
> issue with my work on JDK-8188872 and its specific test. The
> messenger in this case is JDK-8188872. Don't shoot the messenger.
> 
> We have a general policy of last option wins when talking about
> command line options. The general principal means that if an option
> is specified on the command line it overrides any internal setting
> for that option (within the constraints policy, of course). It also
> means that if a command line option is specified more than once on
> the command line, then the last setting wins, but that's not really
> relevant to this discussion.
> 
> If I specify a command line option, I think it is reasonable to
> expect that command line option to do what it claims to do. In this
> particular instance, if I specify: '-XX:-CreateCoredumpOnCrash'
> then it is reasonable to expect the opposite of this:
> 
>    Create core/mini dump on VM fatal error
> 
> which I read as:
> 
>    Do not create core/mini dump on VM fatal error
> 
> os::die() is a fatal error mechanism right? It's our way of telling
> the VM to:
> 
>    // Die immediately, no exit hook, no abort hook, no cleanup.
> 
> Yes, that comment should be updated to include mentioning the
> possibility of a core/mini-dump file. Do you disagree that os::die()
> is a fatal error mechanism?

Yes I disagree that os::die() is our normal/regular "fatal error 
mechanism". That mechanism through VMError::report_and_die() ends up in 
os::abort, and the flag is used on that path. os::die() is lower level 
than that.

>> os::die() means we've badly gone off the rails so we shouldn't pay any 
>> attention to the CreateCoreDumpOnCrash.
> 
> Yes, you've said that more than once. So far you haven't said it with
> any additional information or reasoning to make it anything more than
> your opinion. Don't forget this was my opinion also when I first started
> investigating JDK-8188872 and don't forget that I also inject os::die()
> calls in places where I'm debugging the VM and I want it to die.

Take a look at where we call os:die():

static void UserHandler(int sig, void *siginfo, void *context) {
  ...
   // Ctrl-C is pressed during error reporting, likely because the error
   // handler fails to abort. Let VM die immediately.
   if (sig == SIGINT && VMError::is_error_reported()) {
     os::die();
   }

...

void WatcherThread::run() {
   assert(this == watcher_thread(), "just checking");
    ...

     if (VMError::is_error_reported()) {
       // A fatal error has happened, the error 
handler(VMError::report_and_die)
       // should abort JVM after creating an error log file. However in some
       // rare cases, the error handler itself might deadlock. Here 
periodically
       // check for error reporting timeouts, and if it happens, just 
proceed to
       // abort the VM.
       ...
          err.print_raw_cr("# [ timer expired, abort... ]");
           // skip atexit/vm_exit/vm_abort hooks
           os::die();

---

void VMError::report_and_die(...)
...
      if (recursive_error_count++ > 30) {
         out.print_raw_cr("[Too many errors, abort]");
         os::die();
       }

...
     // if os::abort() doesn't abort, try os::die();
     os::die();

---

I'd say they are all cases of "going of the rails". The normal fatal 
error process hasn't worked as expected so we use os::die() to try and 
force an immediate termination.

> Let me put it another way:
> 
> The command line option -XX:-CreateCoreDumpOnCrash has been specified
> which is a very compelling indication of what the user/test writer wants.
> Why should I (the JVM engineer) decide that I know better than what the
> user/test writer specified?

The flag was not intended to apply to os::die() - yes I only have 
circumstantial historical evidence to back that up, not explicit 
documentation. The flag was added to allow tests that intentionally 
crash the VM to be run in a way that we don't normally get corefiles - 
as they can take a long time to produce, resulting in timeouts and 
excessive disk usage.

And this works perfectly fine for tests that trigger the "normal" fatal 
error handling response.

> 
>> If you change os::die() for this one test you lose the ability to 
>> diagnose when things really did go off the rails.
> 
> That's a strawman argument. If I'm analyzing a test failure for a test
> that has specified -XX:-CreateCoreDumpOnCrash, I can easily reenable
> core dumps when I'm trying to repro the issue.

_If_ you can repro it. The fact we went off the rails and called 
os::die() normally means something very unexpected happened. Your 
chances of repro'ing it are very low IMHO. The core file may be the only 
thing you have to work on.

> Also, a test that uses the -XX:-CreateCoreDumpOnCrash option is typically
> written to look at stdout, stderr or hs_err_pid for very specific crash
> information. If the test doesn't crash the way I expect (because things
> went off the rails), my first move would be to disable 
> CreateCoreDumpOnCrash,
> get a core file and see where it really crashed this time.

Again you're assuming you can reproduce something that is not likely to 
occur very often.

> 
>> CreateCoredumpOnCrash (previously known as CreateMinidumpOnCrash and 
>> used only on windows) is only used for the callers of os::abort to 
>> decide whether to set "dump_core" to true or false.
> 
> I'm not sure what point you're trying to make here. I would expect
> the flag to be used in calls to os::abort() in order to implement
> the policy specified by the option.

See above.

> I found a total of 32 tests that specify '-XX:-CreateCoredumpOnCrash'
> so it is not just one test or test writer that is saying: "I don't care
> about core files." We started adding this option to tests in order to
> reduce test results disk footprint in those cases where the core file
> was not considered important.

And most of those tests are only concerned with the first-level "fatal 
error handler" and everything is fine. But one test (perhaps two?) tries 
to deal with secondary failures during "fatal error handling" and we 
even had to introduce a special flag to assist with that. Those tests 
can more readily end up on a path that calls os::die(). For those 
specific tests we can augment os::die() to recognize when we're doing 
that special testing and so change the normal behaviour.

> 
>>
>>>
>>>>>> Maybe there are paths to os::die that ought to be calling os::abort
>>>>>> instead?
>>>>> Maybe, but not really the problem that I'm trying to solve today. :-)
>>>> That presumes that the problem is the behavior of os::die rather than
>>>> who is calling it.  David and I are suggesting the behavior of os::die
>>>> is fine, and the reason for the unexpected core dump from the test is
>>>> that os::die is being called when (perhaps) it shouldn't be. Or maybe
>>>> the test really is going off into the weeds and a call to os::die is
>>>> appropriate and the core dump might contain useful information.
>>>
>>> I believe in this case, that the author of the step timeout code,
>>> Thomas Stüfe, is calling os::die() intentionally. As noted above, the
>>> test is not off in the weeds.
>>>
>>> You edited out part of my reply where I said:
>>>
>>>> 1) os::die() doesn't say _anything_ about generating a core file:
>>>>
>>>>    src/hotspot/share/runtime/os.hpp:
>>>>
>>>>      // Die immediately, no exit hook, no abort hook, no cleanup.
>>>>      static void die(); 
>>>
>>> and you didn't address my assertion that there is no contract for
>>> os::die() to produce a core file. Remember that I'm also a user of
>>> os::die() when I want a core file... So it surprised me that there's
>>> no contract...
>>
>> Documentation is not exactly a strong point in hotspot code - lack of 
>> it is hardly a basis for inferring that something need not happen.
> 
> Similarly, you (and I originally) were making this inference without a
> basis in documentation:
> 
>  > os::die() means we've badly gone off the rails so we shouldn't pay 
> any attention to the CreateCoreDumpOnCrash.
> 
> 
>> os::die() calls ::abort and we know ::abort does a core dump.
> 
> Yup. Chasing down that 'proof' ended up being an adventure in
> man page reading on the different platforms that I have in my
> lab...
> 
> 
>> Solaris even spells that out:
>>
>> ::abort(); // dump core (for debugging)
> 
> Yes, Solaris does and I saw that because the JDK-8188872 failure that
> I'm chasing has only been seen on Solaris. I've thought about whether
> to add the comment to the other os::die() instances, but we might be
> moving 'os::die()' to os_posix.cpp and that would solve the issue (if
> we retain the comment).
> 
> 
>> So this lack of "contract" is easily fixed.
> 
> We should fix that in src/hotspot/runtime/os.hpp.
> 
> 
>> The point is that this last ditch attempt to kill the process should 
>> not be looking at flags like CreateCoredumpOnCrash.
> 
> Again, I'm going to return to the fact that the user/test-writer
> specified '-XX:-CreateCoredumpOnCrash' option on the command line and
> we (VM engineers) are deciding to ignore that. To me that is not
> consistent with our general cmd line options policy.
> 
> Look at this way. We've found an inconsistency in cmd line option policy.
> We could "fix" it by this change:
> 
>    product(bool, CreateCoredumpOnCrash, true,
>            "Create core/mini dump on VM fatal error. "
>            "os::die() ignores this option and always creates a core/mini 
> dump.")
> 
> That would make it clear to users and test-writers that using
> -XX:-CreateCoredumpOnCrash might work sometimes. Is that what
> we want?

os::die() operates at a lower-level to  the regular fatal error handling 
this flag was intended to interact with.

> 
>> Nor should its behaviour be dictated by a test!
> 
> That's a really interesting statement. And I do hope I'm
> reading the underlying attitude wrong.
> 
> I see this as a cmd line option consistency issue so the behaviour
> is dictated by the option and the user or test-writer that specifies
> the option.

I don't see it as that at all. I see a very low-level last-resort 
termination function that isn't part of regular abnormal termination 
mechanism and which therefore isn't expected to interact with flags that 
affect that mechanism.

Cheers,
David
-----

> 
> This has been a really interesting discussion. I was
> originally going to keep the fix for this bug bundled up
> with my fix for JDK-8188872. But it occurred to me that
> this change really needed to be tracked on its own so
> that's why I split it off.
> 
> Thanks for continuing the discussion.
> 
> Dan
> 
> 
>>
>> Thanks,
>> David
>> -----
>>
>>>>>> I think os::abort probably should not be calling ::exit either, but
>>>>>> should be raising SIGKILL if no core dump is requested. os::exit is
>>>>>> the path to exit functions and the like.
>>>>> Agreed (or maybe not). If I'm in os::abort() and I call '::exit()',
>>>>> am I calling 'os::exit()' or am I calling libc's exit()? I thought
>>>>> I was calling libc's exit(), but I could be wrong…
>>>> ::exit() is libc's exit(), which calls atexit handlers.
>>>>
>>>> Also, ::_exit() or _Exit() (the latter is in <stdlib.h>) might be
>>>> alternatives to using SIGKILL.
>>>
>>> I'm testing SIGKILL right now... :-)
>>>
>>>
>>>>
>>>>>> It's unfortunate that the Linux version of os::abort needs to do
>>>>>> something special for DumpPrivateMappingsInCore; without that, we
>>>>>> could have a merged posix version of os::abort.
>>>>> I think that's a fairly recent addition to the Linux version of
>>>>> os::abort(). It could probably be handled by a posix version of
>>>>> os::abort() that calls a PD version of os::abort() as needed or
>>>>> something like that.
>>>>>
>>>>> Again, not the problem that I'm trying to solve today.
>>>> Agreed.
>>>>
>>>
>>> Thanks for continuing the discussion...
>>>
>>> Dan
>>>
>