RFR(XXS): 8224793: os::die() does not honor CreateCoredumpOnCrash option

Thu May 30 03:05:13 UTC 2019

Hi Dan,

On 30/05/2019 3:50 am, Daniel D. Daugherty wrote:
> On 5/29/19 1:19 PM, Kim Barrett wrote:
>>> On May 29, 2019, at 10:27 AM, Daniel D. Daugherty 
>>> <daniel.daugherty at oracle.com> wrote:
>>> On 5/28/19 9:03 PM, Kim Barrett wrote:
>>>> And looking at this some more, I think I agree with David's comment in
>>>> the CR, that os::die *should* always dump core.  It gets called when
>>>> things have gone horribly wrong, and a core dump might be the only way
>>>> to understand what went wrong.
>>> I understand where you and David are coming from. […]
>>>
>>> To run a test with '-XX:-CreateCoredumpOnCrash' and then see a core
>>> file in the execution directory violates the principle of least
>>> astonishment. From the test author's POV, they did what they are
>>> supposed to do (passing the '-XX:-CreateCoredumpOnCrash' option)
>>> and the result is unexpected.
>> The position that David and I are taking is that the unexpected core
>> dump is not a problem, and is indeed a good thing, because the test
>> apparently ran off into the weeds.
> 
> The test in JDK-8188872 didn't run off into the weeds. It is testing
> exactly what it is intending to test:
> 
>    - If an error handling step times out, we get a message to tell us
>      that the step timed out. We make sure we get at least two of those.
>    - If error handling reaches the global timeout, then the VM should
>      have been aborted by the WatcherThread. That's the intentional
>      os::die() call...
> 
> and I'm asserting that the test author knows what he/she is doing when
> they specify the '-XX:-CreateCoredumpOnCrash' option. Please explain
> why you think it is a good thing to ignore what the test writer asked
> for and dump core when '-XX:-CreateCoredumpOnCrash' option is specified?

Because we should not be changing the behaviour of os::die() because of 
what one very specific test that intends to end up calling die() 
desires. os::die() means we've badly gone off the rails so we shouldn't 
pay any attention to the CreateCoreDumpOnCrash. If you change os::die() 
for this one test you lose the ability to diagnose when things really 
did go off the rails.

CreateCoredumpOnCrash (previously known as CreateMinidumpOnCrash and 
used only on windows) is only used for the callers of os::abort to 
decide whether to set "dump_core" to true or false.

> 
>>>> Maybe there are paths to os::die that ought to be calling os::abort
>>>> instead?
>>> Maybe, but not really the problem that I'm trying to solve today. :-)
>> That presumes that the problem is the behavior of os::die rather than
>> who is calling it.  David and I are suggesting the behavior of os::die
>> is fine, and the reason for the unexpected core dump from the test is
>> that os::die is being called when (perhaps) it shouldn't be.  Or maybe
>> the test really is going off into the weeds and a call to os::die is
>> appropriate and the core dump might contain useful information.
> 
> I believe in this case, that the author of the step timeout code,
> Thomas Stüfe, is calling os::die() intentionally. As noted above, the
> test is not off in the weeds.
> 
> You edited out part of my reply where I said:
> 
>> 1) os::die() doesn't say _anything_ about generating a core file:
>>
>>    src/hotspot/share/runtime/os.hpp:
>>
>>      // Die immediately, no exit hook, no abort hook, no cleanup.
>>      static void die(); 
> 
> and you didn't address my assertion that there is no contract for
> os::die() to produce a core file. Remember that I'm also a user of
> os::die() when I want a core file... So it surprised me that there's
> no contract...

Documentation is not exactly a strong point in hotspot code - lack of it 
is hardly a basis for inferring that something need not happen. 
os::die() calls ::abort and we know ::abort does a core dump. Solaris 
even spells that out:

::abort(); // dump core (for debugging)

So this lack of "contract" is easily fixed.

The point is that this last ditch attempt to kill the process should not 
be looking at flags like CreateCoredumpOnCrash. Nor should its behaviour 
be dictated by a test!

Thanks,
David
-----

>>>> I think os::abort probably should not be calling ::exit either, but
>>>> should be raising SIGKILL if no core dump is requested.  os::exit is
>>>> the path to exit functions and the like.
>>> Agreed (or maybe not). If I'm in os::abort() and I call '::exit()',
>>> am I calling 'os::exit()' or am I calling libc's exit()? I thought
>>> I was calling libc's exit(), but I could be wrong…
>> ::exit() is libc's exit(), which calls atexit handlers.
>>
>> Also, ::_exit() or _Exit() (the latter is in <stdlib.h>) might be
>> alternatives to using SIGKILL.
> 
> I'm testing SIGKILL right now... :-)
> 
> 
>>
>>>> It's unfortunate that the Linux version of os::abort needs to do
>>>> something special for DumpPrivateMappingsInCore; without that, we
>>>> could have a merged posix version of os::abort.
>>> I think that's a fairly recent addition to the Linux version of
>>> os::abort(). It could probably be handled by a posix version of
>>> os::abort() that calls a PD version of os::abort() as needed or
>>> something like that.
>>>
>>> Again, not the problem that I'm trying to solve today.
>> Agreed.
>>
> 
> Thanks for continuing the discussion...
> 
> Dan
>