RFR(XXS): 8224793: os::die() does not honor CreateCoredumpOnCrash option

Fri May 31 15:13:29 UTC 2019

Hi David (and Kim),

Please start with my reply to Kim's first message from last night...

Only some scattered comments below...

On 5/31/19 5:18 AM, David Holmes wrote:
> Hi Dan,
>
> A few responses below.
>
> On 31/05/2019 1:31 am, Daniel D. Daugherty wrote:
>> Hi David,
>>
>> On 5/29/19 11:05 PM, David Holmes wrote:
>>> Hi Dan,
>>>
>>> On 30/05/2019 3:50 am, Daniel D. Daugherty wrote:
>>>> On 5/29/19 1:19 PM, Kim Barrett wrote:
>>>>>> On May 29, 2019, at 10:27 AM, Daniel D. Daugherty 
>>>>>> <daniel.daugherty at oracle.com> wrote:
>>>>>> On 5/28/19 9:03 PM, Kim Barrett wrote:
>>>>>>> And looking at this some more, I think I agree with David's 
>>>>>>> comment in
>>>>>>> the CR, that os::die *should* always dump core.  It gets called 
>>>>>>> when
>>>>>>> things have gone horribly wrong, and a core dump might be the 
>>>>>>> only way
>>>>>>> to understand what went wrong.
>>>>>> I understand where you and David are coming from. […]
>>>>>>
>>>>>> To run a test with '-XX:-CreateCoredumpOnCrash' and then see a core
>>>>>> file in the execution directory violates the principle of least
>>>>>> astonishment. From the test author's POV, they did what they are
>>>>>> supposed to do (passing the '-XX:-CreateCoredumpOnCrash' option)
>>>>>> and the result is unexpected.
>>>>> The position that David and I are taking is that the unexpected core
>>>>> dump is not a problem, and is indeed a good thing, because the test
>>>>> apparently ran off into the weeds.
>>>>
>>>> The test in JDK-8188872 didn't run off into the weeds. It is testing
>>>> exactly what it is intending to test:
>>>>
>>>>    - If an error handling step times out, we get a message to tell us
>>>>      that the step timed out. We make sure we get at least two of 
>>>> those.
>>>>    - If error handling reaches the global timeout, then the VM should
>>>>      have been aborted by the WatcherThread. That's the intentional
>>>>      os::die() call...
>>>>
>>>> and I'm asserting that the test author knows what he/she is doing when
>>>> they specify the '-XX:-CreateCoredumpOnCrash' option. Please explain
>>>> why you think it is a good thing to ignore what the test writer asked
>>>> for and dump core when '-XX:-CreateCoredumpOnCrash' option is 
>>>> specified?
>>>
>>> Because we should not be changing the behaviour of os::die() because 
>>> of what one very specific test that intends to end up calling die() 
>>> desires.
>>
>> Hmmmm... I'm proposing to change the behavior of os::die() because
>> we have an option that says this:
>>
>> src/hotspot/share/runtime/globals.hpp:
>>
>>>   product(bool, CreateCoredumpOnCrash, 
>>> true,                                \
>>>           "Create core/mini dump on VM fatal 
>>> error")                        \
>>
>> so when we turn that option off, I'm asserting that we should not
>> create a core dump on a VM fatal error. I happened to run into the
>> issue with my work on JDK-8188872 and its specific test. The
>> messenger in this case is JDK-8188872. Don't shoot the messenger.
>>
>> We have a general policy of last option wins when talking about
>> command line options. The general principal means that if an option
>> is specified on the command line it overrides any internal setting
>> for that option (within the constraints policy, of course). It also
>> means that if a command line option is specified more than once on
>> the command line, then the last setting wins, but that's not really
>> relevant to this discussion.
>>
>> If I specify a command line option, I think it is reasonable to
>> expect that command line option to do what it claims to do. In this
>> particular instance, if I specify: '-XX:-CreateCoredumpOnCrash'
>> then it is reasonable to expect the opposite of this:
>>
>>    Create core/mini dump on VM fatal error
>>
>> which I read as:
>>
>>    Do not create core/mini dump on VM fatal error
>>
>> os::die() is a fatal error mechanism right? It's our way of telling
>> the VM to:
>>
>>    // Die immediately, no exit hook, no abort hook, no cleanup.
>>
>> Yes, that comment should be updated to include mentioning the
>> possibility of a core/mini-dump file. Do you disagree that os::die()
>> is a fatal error mechanism?
>
> Yes I disagree that os::die() is our normal/regular "fatal error 
> mechanism". That mechanism through VMError::report_and_die() ends up 
> in os::abort, and the flag is used on that path. os::die() is lower 
> level than that.

I think this is sentence that I needed.

>
>>> os::die() means we've badly gone off the rails so we shouldn't pay 
>>> any attention to the CreateCoreDumpOnCrash.
>>
>> Yes, you've said that more than once. So far you haven't said it with
>> any additional information or reasoning to make it anything more than
>> your opinion. Don't forget this was my opinion also when I first started
>> investigating JDK-8188872 and don't forget that I also inject os::die()
>> calls in places where I'm debugging the VM and I want it to die.
>
> Take a look at where we call os:die():
>
> static void UserHandler(int sig, void *siginfo, void *context) {
>  ...
>   // Ctrl-C is pressed during error reporting, likely because the error
>   // handler fails to abort. Let VM die immediately.
>   if (sig == SIGINT && VMError::is_error_reported()) {
>     os::die();
>   }

I hate this one! :-)
Sometimes when I CTRL-C a Java program it causes an error report, but
I've already pressed CTRL-C again and now I get a core file... sigh...

>
> ...
>
> void WatcherThread::run() {
>   assert(this == watcher_thread(), "just checking");
>    ...
>
>     if (VMError::is_error_reported()) {
>       // A fatal error has happened, the error 
> handler(VMError::report_and_die)
>       // should abort JVM after creating an error log file. However in 
> some
>       // rare cases, the error handler itself might deadlock. Here 
> periodically
>       // check for error reporting timeouts, and if it happens, just 
> proceed to
>       // abort the VM.
>       ...
>          err.print_raw_cr("# [ timer expired, abort... ]");
>           // skip atexit/vm_exit/vm_abort hooks
>           os::die();
>
> ---
>
> void VMError::report_and_die(...)
> ...
>      if (recursive_error_count++ > 30) {
>         out.print_raw_cr("[Too many errors, abort]");
>         os::die();
>       }
>
> ...
>     // if os::abort() doesn't abort, try os::die();
>     os::die();
>
> ---
>
> I'd say they are all cases of "going of the rails". The normal fatal 
> error process hasn't worked as expected so we use os::die() to try and 
> force an immediate termination.

Agreed. Thanks for laying out the details like this.

>
>> Let me put it another way:
>>
>> The command line option -XX:-CreateCoreDumpOnCrash has been specified
>> which is a very compelling indication of what the user/test writer 
>> wants.
>> Why should I (the JVM engineer) decide that I know better than what the
>> user/test writer specified?
>
> The flag was not intended to apply to os::die() - yes I only have 
> circumstantial historical evidence to back that up, not explicit 
> documentation. The flag was added to allow tests that intentionally 
> crash the VM to be run in a way that we don't normally get corefiles - 
> as they can take a long time to produce, resulting in timeouts and 
> excessive disk usage.

Okay, I'm convinced.

> And this works perfectly fine for tests that trigger the "normal" 
> fatal error handling response.

Agreed.

>
>>
>>> If you change os::die() for this one test you lose the ability to 
>>> diagnose when things really did go off the rails.
>>
>> That's a strawman argument. If I'm analyzing a test failure for a test
>> that has specified -XX:-CreateCoreDumpOnCrash, I can easily reenable
>> core dumps when I'm trying to repro the issue.
>
> _If_ you can repro it.

Agreed.

> The fact we went off the rails and called os::die() normally means 
> something very unexpected happened. Your chances of repro'ing it are 
> very low IMHO. The core file may be the only thing you have to work on.

Yup. And that core file will only be around if the test was smart enough
to realize that something went wrong with the error that it was looking
for. If the test passed, the artifacts will get cleaned up...

>
>> Also, a test that uses the -XX:-CreateCoreDumpOnCrash option is 
>> typically
>> written to look at stdout, stderr or hs_err_pid for very specific crash
>> information. If the test doesn't crash the way I expect (because things
>> went off the rails), my first move would be to disable 
>> CreateCoreDumpOnCrash,
>> get a core file and see where it really crashed this time.
>
> Again you're assuming you can reproduce something that is not likely 
> to occur very often.

Yup. There's a reason that JDK-8188872 has taken so long. It happens
rarely in my lab. It started happening even more rarely on SPARC
machines in the CI so I continued the hunt in my lab.

>
>>
>>> CreateCoredumpOnCrash (previously known as CreateMinidumpOnCrash and 
>>> used only on windows) is only used for the callers of os::abort to 
>>> decide whether to set "dump_core" to true or false.
>>
>> I'm not sure what point you're trying to make here. I would expect
>> the flag to be used in calls to os::abort() in order to implement
>> the policy specified by the option.
>
> See above.
>
>> I found a total of 32 tests that specify '-XX:-CreateCoredumpOnCrash'
>> so it is not just one test or test writer that is saying: "I don't care
>> about core files." We started adding this option to tests in order to
>> reduce test results disk footprint in those cases where the core file
>> was not considered important.
>
> And most of those tests are only concerned with the first-level "fatal 
> error handler" and everything is fine. But one test (perhaps two?) 
> tries to deal with secondary failures during "fatal error handling" 
> and we even had to introduce a special flag to assist with that. Those 
> tests can more readily end up on a path that calls os::die(). For 
> those specific tests we can augment os::die() to recognize when we're 
> doing that special testing and so change the normal behaviour.

Agreed and that's our way forward.

>
>>
>>>
>>>>
>>>>>>> Maybe there are paths to os::die that ought to be calling os::abort
>>>>>>> instead?
>>>>>> Maybe, but not really the problem that I'm trying to solve today. 
>>>>>> :-)
>>>>> That presumes that the problem is the behavior of os::die rather than
>>>>> who is calling it.  David and I are suggesting the behavior of 
>>>>> os::die
>>>>> is fine, and the reason for the unexpected core dump from the test is
>>>>> that os::die is being called when (perhaps) it shouldn't be. Or maybe
>>>>> the test really is going off into the weeds and a call to os::die is
>>>>> appropriate and the core dump might contain useful information.
>>>>
>>>> I believe in this case, that the author of the step timeout code,
>>>> Thomas Stüfe, is calling os::die() intentionally. As noted above, the
>>>> test is not off in the weeds.
>>>>
>>>> You edited out part of my reply where I said:
>>>>
>>>>> 1) os::die() doesn't say _anything_ about generating a core file:
>>>>>
>>>>>    src/hotspot/share/runtime/os.hpp:
>>>>>
>>>>>      // Die immediately, no exit hook, no abort hook, no cleanup.
>>>>>      static void die(); 
>>>>
>>>> and you didn't address my assertion that there is no contract for
>>>> os::die() to produce a core file. Remember that I'm also a user of
>>>> os::die() when I want a core file... So it surprised me that there's
>>>> no contract...
>>>
>>> Documentation is not exactly a strong point in hotspot code - lack 
>>> of it is hardly a basis for inferring that something need not happen.
>>
>> Similarly, you (and I originally) were making this inference without a
>> basis in documentation:
>>
>>  > os::die() means we've badly gone off the rails so we shouldn't pay 
>> any attention to the CreateCoreDumpOnCrash.
>>
>>
>>> os::die() calls ::abort and we know ::abort does a core dump.
>>
>> Yup. Chasing down that 'proof' ended up being an adventure in
>> man page reading on the different platforms that I have in my
>> lab...
>>
>>
>>> Solaris even spells that out:
>>>
>>> ::abort(); // dump core (for debugging)
>>
>> Yes, Solaris does and I saw that because the JDK-8188872 failure that
>> I'm chasing has only been seen on Solaris. I've thought about whether
>> to add the comment to the other os::die() instances, but we might be
>> moving 'os::die()' to os_posix.cpp and that would solve the issue (if
>> we retain the comment).
>>
>>
>>> So this lack of "contract" is easily fixed.
>>
>> We should fix that in src/hotspot/runtime/os.hpp.
>>
>>
>>> The point is that this last ditch attempt to kill the process should 
>>> not be looking at flags like CreateCoredumpOnCrash.
>>
>> Again, I'm going to return to the fact that the user/test-writer
>> specified '-XX:-CreateCoredumpOnCrash' option on the command line and
>> we (VM engineers) are deciding to ignore that. To me that is not
>> consistent with our general cmd line options policy.
>>
>> Look at this way. We've found an inconsistency in cmd line option 
>> policy.
>> We could "fix" it by this change:
>>
>>    product(bool, CreateCoredumpOnCrash, true,
>>            "Create core/mini dump on VM fatal error. "
>>            "os::die() ignores this option and always creates a 
>> core/mini dump.")
>>
>> That would make it clear to users and test-writers that using
>> -XX:-CreateCoredumpOnCrash might work sometimes. Is that what
>> we want?
>
> os::die() operates at a lower-level to  the regular fatal error 
> handling this flag was intended to interact with.

I think I'll update the comment runtime/os.hpp instead.

>
>>
>>> Nor should its behaviour be dictated by a test!
>>
>> That's a really interesting statement. And I do hope I'm
>> reading the underlying attitude wrong.
>>
>> I see this as a cmd line option consistency issue so the behaviour
>> is dictated by the option and the user or test-writer that specifies
>> the option.
>
> I don't see it as that at all. I see a very low-level last-resort 
> termination function that isn't part of regular abnormal termination 
> mechanism and which therefore isn't expected to interact with flags 
> that affect that mechanism.

Agreed. I'm convinced.

Thanks for continuing the conversation.

Dan

>
> Cheers,
> David
> -----
>
>>
>> This has been a really interesting discussion. I was
>> originally going to keep the fix for this bug bundled up
>> with my fix for JDK-8188872. But it occurred to me that
>> this change really needed to be tracked on its own so
>> that's why I split it off.
>>
>> Thanks for continuing the discussion.
>>
>> Dan
>>
>>
>>>
>>> Thanks,
>>> David
>>> -----
>>>
>>>>>>> I think os::abort probably should not be calling ::exit either, but
>>>>>>> should be raising SIGKILL if no core dump is requested. os::exit is
>>>>>>> the path to exit functions and the like.
>>>>>> Agreed (or maybe not). If I'm in os::abort() and I call '::exit()',
>>>>>> am I calling 'os::exit()' or am I calling libc's exit()? I thought
>>>>>> I was calling libc's exit(), but I could be wrong…
>>>>> ::exit() is libc's exit(), which calls atexit handlers.
>>>>>
>>>>> Also, ::_exit() or _Exit() (the latter is in <stdlib.h>) might be
>>>>> alternatives to using SIGKILL.
>>>>
>>>> I'm testing SIGKILL right now... :-)
>>>>
>>>>
>>>>>
>>>>>>> It's unfortunate that the Linux version of os::abort needs to do
>>>>>>> something special for DumpPrivateMappingsInCore; without that, we
>>>>>>> could have a merged posix version of os::abort.
>>>>>> I think that's a fairly recent addition to the Linux version of
>>>>>> os::abort(). It could probably be handled by a posix version of
>>>>>> os::abort() that calls a PD version of os::abort() as needed or
>>>>>> something like that.
>>>>>>
>>>>>> Again, not the problem that I'm trying to solve today.
>>>>> Agreed.
>>>>>
>>>>
>>>> Thanks for continuing the discussion...
>>>>
>>>> Dan
>>>>
>>