RFR (S): 8007779: os::die() on solaris should generate core file

Tue Feb 12 02:22:52 PST 2013

Did we reach a conclusion on this change? Was it ok to call abort() so that we get core files and can debug these crashes?

Thanks,
/Staffan

On 8 feb 2013, at 19:21, Dean Long <dean.long at oracle.com> wrote:

> The 0x80 bit should be part of the value returned by wait(2), but it may be shell-specific which
> bits are captured by $?.
> 
> dl
> 
> On 2/8/2013 8:04 AM, Mikael Vidstedt wrote:
>> On 2013-02-08 07:12, Mikael Gerdin wrote:
>>> On 2013-02-08 15:05, Staffan Larsen wrote:
>>>> The return code from the process seems to be 134 (after an experiment). This would be the same as after a successful printing of hs_err when we do manage to create a core dump.
>>> 
>>> When a posix process is terminated by an uncaught fatal signal the exit code is usually 128 + SIGNAL
>>> Since SIGABRT == 6 you got 134
>> 
>> I believe the 128+n may be for bash specifically, not for general posix processes, but the same conclusion goes.
>> 
>> /Another Mikael
>> 
>>> 
>>> /Mikael
>>> 
>>>> 
>>>> /Staffan
>>>> 
>>>> On 8 feb 2013, at 14:54, David Holmes <david.holmes at oracle.com> wrote:
>>>> 
>>>>> My other email hasn't turned up yet but I was confusing this with the change that added the dump_core flag to os::abort.
>>>>> 
>>>>> It's only by "accident" that we use ::abort on linux - _exit didn't work back in the old days of LinuxThreads :)
>>>>> 
>>>>> This seems like a simple and potentially useful change, but I have a feeling it may have some unexpected consequences somewhere. :)
>>>>> 
>>>>> Actually one possible consequence - what return code will the process issue if it now hits this? Could this impact testing and failure matching ?
>>>>> 
>>>>> David
>>>>> 
>>>>> On 8/02/2013 10:24 PM, Staffan Larsen wrote:
>>>>>> This is a request for review of a small change to the crash reporting on solaris.
>>>>>> 
>>>>>> When hotspot crashes during the writing of the hs_err file, we call os::die(). On linux and bsd this causes a core file to be written (by calling ::abort()). This is good since we then have some record of what went wrong. On solaris, we call _exit() and no core file is created.
>>>>>> 
>>>>>> There are two cases during the hs_err writing where we call os::die(). First, if the writing hangs, the WatcherThread will call os::die(). Second, if we get too many errors during the writing we will call os::die(). In both these cases it would be very helpful to have a core file. Otherwise all you have to go on is something like this:
>>>>>> 
>>>>>> # A fatal error has been detected by the Java Runtime Environment:
>>>>>> #
>>>>>> # SIGSEGV (0xb) at pc=0xffffffff653848c0, pid=11823, tid=240
>>>>>> #
>>>>>> # JRE version: Java(TM) SE Runtime Environment (7.0_12-b11)
>>>>>> # Java VM: Java HotSpot(TM) 64-Bit Server VM (24.0-b24 mixed mode solaris-sparc compressed oops)
>>>>>> # Problematic frame:
>>>>>> # C [libc.so.1+0x848c0]# [ timer expired, abort... ]
>>>>>> 
>>>>>> Below is the change I would like to do.
>>>>>> 
>>>>>> Thanks,
>>>>>> /Staffan
>>>>>> 
>>>>>> 
>>>>>> diff --git a/src/os/solaris/vm/os_solaris.cpp b/src/os/solaris/vm/os_solaris.cpp
>>>>>> --- a/src/os/solaris/vm/os_solaris.cpp
>>>>>> +++ b/src/os/solaris/vm/os_solaris.cpp
>>>>>> @@ -1865,7 +1865,7 @@
>>>>>> 
>>>>>>  // Die immediately, no exit hook, no abort hook, no cleanup.
>>>>>>  void os::die() {
>>>>>> -  _exit(-1);
>>>>>> +  ::abort(); // dump core (for debugging)
>>>>>>  }
>>>>>> 
>>>>>> 
>>>> 
>>