RFR(s): 8065895: Synchronous signals during error reporting may terminate or hang VM process

Sun Feb 1 15:56:08 UTC 2015

Hi Gerard,

thanks for looking into this patch! See comments below.

On Fri, Jan 30, 2015 at 11:23 PM, Gerard Ziemski <gerard.ziemski at oracle.com>
wrote:

> hi Thomas,
>
> #1 The following comment in vmError_bsd.cpp, vmError_linux.cpp and
> vmError_solaris.cpp is now outdated and should be updated from:
>
> // handle all synchronous program error signals which may happen during
> error
> // reporting. They must be unblocked, caught, handled.
> //
> // Note that in hotspot signal handlers are installed with a fully filled
> // signal mask, i.e. upon delivery all signals - also synchronous error
> signals
> // - are blocked. For synchronous error signals this is deadly - program
> will
> // hang or be terminated immediately if secondary errors happen during
> error
> // reporting.
>
> to the comment from vmError_aix.cpp:
>
> // Handle all synchronous signals which may happen during signal handling,
> // not just SIGSEGV and SIGBUS.
>
>
I am not sure I understand: the problem is that signal handlers are
installed in the hotspot with sa_mask completely set, so all signals are
blocked. See e.g. os_linux.cpp, line 4304++
(os::Linux::set_signal_handler()). So the comment appears to be correct.

The AIX version of the comment is a bit different because there we do not
install the original signal handlers with sa_mask fully set.

> #2 Are we sue that "SIGSEGV, SIGBUS, SIGILL, SIGFPE, SIGTRAP” are all the
> necessary signals we need? I tried to look on the web for which signals are
> synchronous, but I can’t find anything.

POSIX itself mentioned only the four, see documentation on sigprocmask:

"If any of the SIGFPE, SIGILL, SIGSEGV, or SIGBUS signals are generated
while they are blocked, the result is undefined, unless the signal was
generated by the *kill*()
<http://pubs.opengroup.org/onlinepubs/009695399/functions/kill.html> function,
the *sigqueue*()
<http://pubs.opengroup.org/onlinepubs/009695399/functions/sigqueue.html>
function,
or the *raise*()
<http://pubs.opengroup.org/onlinepubs/009695399/functions/raise.html>
 function."

I added SIGTRAP on AIX because it is synchronously delivered too, but do
not know any other signals. I am happy to add any other signals you know of.

I know you used the code from vmError_aix.cpp, but I thought that maybe we
> need to take the opportunity to review this list?
>

I added the fix originally to our (SAPs) codebase some years ago for HP-UX,
then ported it to the other Unices. At the same time we contributed the AIX
port to the OpenJDK, so that is why in the OpenJDK AIX is the only platform
with parts of the fix. So the list was originally from Linux.

>
> #3 The code:
>
>   STEP(13, "(test secondary crash 1)")
>     if (TestCrashInErrorHandler != 0) {
>       controlled_crash(TestCrashInErrorHandler);
>     }
>
>   STEP(14, "(test secondary crash 2)")
>     if (TestCrashInErrorHandler != 0) {
>       controlled_crash(TestCrashInErrorHandler);
>     }
>
> gets triggered twice (for a total of 4 secondary tests) due to the fact
> that in VMError::report_and_die() we call VMError::report twice, once for
> (!out_done) and once for (!log_done). Is that OK, or do we want it to fix
> it so that it gets triggered only once?
>

I would like to test it at least twice, to see that signal handling
continues to work after secondary handler ran once. It needs to correctly
unset the signal mask for the synchronous signals.

Executing it four times may be a bit excessive; if you prefer, I can remove
it for the out_done case.

Note that there is one benefit of repeatedly crashing in the error handler,
it is to stress the stack usage during error reporting. But I admit that
this test could be done much better.

> #4 There is an existing test using "ErrorHandlerTest" in
> jdk9/hotspot/test/runtime/6888954/vmerrors.sh Should’t we update it to
> include TestCrashInErrorHandler, or better yet create a similar one but for
> "TestCrashInErrorHandler"?
>
>
vmerrors.sh is very minimal. The bug symptom was incomplete or empty hs-err
files, and you would not catch this with the vmerrors.sh test, would you
not? You also can reproduce this bug only by using both
-XX:ErrorHandlerTest and -XX:TestCrashInErrorHandler together, so it does
not quite fit into what vmerrors.sh does.

At SAP we have a nice test suite which crashes the VM in a number of ways
and then checks whether the hs-err files are complete (enough). I was
planning to add a Jtreg test which does something similar, but did not add
it to this change because I did not want to make the change larger than
necessary.

Cheers,

Thomas

> cheers
>
> On 1/26/2015 4:23 AM, Thomas Stüfe wrote:
>
>> Anyone? I would really like to close this issue.
>>
>> Kind Regards,
>>
>> Thomas Stüfe
>>
>> On Wed, Jan 21, 2015 at 10:39 AM, Thomas Stüfe <thomas.stuefe at gmail.com>
>> wrote:
>>
>>  Hi all,
>>>
>>> I would like to take up discussion about this issue again.
>>>
>>> As a reminder, here the bug report:
>>> https://bugs.openjdk.java.net/browse/JDK-8065895
>>>
>>> And here a new version of my patch:
>>> http://cr.openjdk.java.net/~stuefe/webrevs/8065895/webrev.05/webrev/
>>>
>>> As far as I remember the discussion, the issue itself was understood and
>>> the fix itself met with approval (I think), but the point of contention
>>> was
>>> the code I added to reproduce the error for regression tests.
>>>
>>> In order to reproduce the error, I need two different synchronous signals
>>> to happen, one in normal code, one in the error handler which writes the
>>> hs-err file. Originally I choose SIGSEGV and SIGILL. I added functions to
>>> generate (true, real) SIGSEGVs and SIGILLs. But nobody liked my
>>> generate-sigill function, therefore I changed the code to generate a
>>> SIGFPE
>>> instead. For the test, it does not matter which signals are generated as
>>> long as they are synchronous (so, one of SEGV,ILL,BUS,FPE).
>>>
>>> In order to check that the fix works, one can do:
>>>
>>> java -XX:ErrorHandlerTest=15 -XX:TestCrashInErrorHandler=14
>>>
>>> The VM will first crash with a SIGSEGV, enter error handling, crash again
>>> with a different synchronous signal (FPE, in this case). An unfixed VM
>>> will
>>> die immediately and we get a torn hs-err file. A fixed VM will show the
>>> "error occurred during error handling" string and continue with the error
>>> reporting.
>>>
>>> Kind Regards,
>>>
>>> Thomas Stüfe
>>>
>>>
>>>
>>> On Mon, Dec 8, 2014 at 11:37 AM, Thomas Stüfe <thomas.stuefe at gmail.com>
>>> wrote:
>>>
>>>  Hi David, Dean,
>>>>
>>>> On Fri, Dec 5, 2014 at 8:05 AM, David Holmes <david.holmes at oracle.com>
>>>> wrote:
>>>>
>>>>  On 3/12/2014 8:47 PM, Thomas Stüfe wrote:
>>>>>
>>>>>  Hi Dean,
>>>>>>
>>>>>> I dont understand. Such a function does not exist, does it? So I would
>>>>>> have to write it:
>>>>>>
>>>>>> Do you mean generating and using a StubRoutine which would SIGILL? I
>>>>>> did
>>>>>> not do this because I wanted to be able to generate SIGILL also in
>>>>>> initialization code, where StubRoutines may not yet be generated. This
>>>>>> point may may be arguable, but as this function is used to test error
>>>>>> handling, it may be interesting to test it for half-initialized VMs
>>>>>> too.
>>>>>>
>>>>>> Otherwise I would implement the CPU specific
>>>>>> generate_illegal_instruction___sequence() probably the same way as I
>>>>>> do
>>>>>> now the crash_with_sigill() function. That would mean a bit of more
>>>>>> code
>>>>>> duplication because:
>>>>>> - Either I use the method I use now (reserve_memory and copy the
>>>>>> instructions to the reserved page)
>>>>>> - Or I use inline assembly - which probably does not work across
>>>>>> multiple OSs, so for CPUs which span various OSs I would have to add
>>>>>> one
>>>>>> function per os_cpu combination, not just per cpu.
>>>>>>
>>>>>>  I don't think there is any OS dependency with inline assembly - only
>>>>> compiler. And I am also concerned that writing code to an executable
>>>>> page
>>>>> will also enter the realm of "self-modifying code" and all the jumping
>>>>> through hoops that entails. That aspect hadn't occurred to me till Dean
>>>>> raised it. I'm forming the view that triggering a SIGILL is more effort
>>>>> than it is worth for a secondary testing function.
>>>>>
>>>>>
>>>>>  Well, the code is used and works in our VM since some years on a
>>>> number
>>>> of CPUs, so the problem with the flushing do not occur at least in our
>>>> cases. But I agree with you, and this seems to be a point of contention
>>>> and
>>>> it is really too unimportant to stop the whole patch.
>>>>
>>>> The whole point of using SIGILL was to have another unblockable signal
>>>> besides SIGSEGV to occur naturally (without raising) to be able to
>>>> demonstrate the bug before fixing it. I will now attempt to change the
>>>> patch to use either SIGFPE or SIGBUS as a secondary signal. Maybe
>>>> generating those signals with pure C/C++ is easier. If that does not
>>>> work
>>>> out, I will see what I can do with raise().
>>>>
>>>> Thanks and Kind regards, Thomas
>>>>
>>>>
>>>>
>>>>
>