RFR(s): 8065895: Synchronous signals during error reporting may terminate or hang VM process

Gerard Ziemski gerard.ziemski at oracle.com
Tue Feb 3 21:15:19 UTC 2015


Looks good! Thank you for a great fix!

cheers

On 2/3/2015 10:40 AM, Thomas Stüfe wrote:
> Hi Gerard,
>
> thank you for your feedback! See here my new change: 
> http://cr.openjdk.java.net/~stuefe/webrevs/8065895/webrev.07/webrev/ 
> <http://cr.openjdk.java.net/%7Estuefe/webrevs/8065895/webrev.07/webrev/>
>
> On Mon, Feb 2, 2015 at 4:19 PM, Gerard Ziemski 
> <gerard.ziemski at oracle.com <mailto:gerard.ziemski at oracle.com>> wrote:
>
>     hi Thomas,
>
>     On 2/1/2015 9:56 AM, Thomas Stüfe wrote:
>>     On Fri, Jan 30, 2015 at 11:23 PM, Gerard Ziemski
>>     <gerard.ziemski at oracle.com <mailto:gerard.ziemski at oracle.com>> wrote:
>>
>>         hi Thomas,
>>
>>         #1 The following comment in vmError_bsd.cpp,
>>         vmError_linux.cpp and vmError_solaris.cpp is now outdated and
>>         should be updated from:
>>
>>         // handle all synchronous program error signals which may
>>         happen during error
>>         // reporting. They must be unblocked, caught, handled.
>>         //
>>         // Note that in hotspot signal handlers are installed with a
>>         fully filled
>>         // signal mask, i.e. upon delivery all signals - also
>>         synchronous error signals
>>         // - are blocked. For synchronous error signals this is
>>         deadly - program will
>>         // hang or be terminated immediately if secondary errors
>>         happen during error
>>         // reporting.
>>
>>         to the comment from vmError_aix.cpp:
>>
>>         // Handle all synchronous signals which may happen during
>>         signal handling,
>>         // not just SIGSEGV and SIGBUS.
>>
>>
>>     I am not sure I understand: the problem is that signal handlers
>>     are installed in the hotspot with sa_mask completely set, so all
>>     signals are blocked. See e.g. os_linux.cpp, line 4304++
>>     (os::Linux::set_signal_handler()). So the comment appears to be
>>     correct.
>     We continue to block the signals initially, but the comment, or at
>     least its last sentence, does not apply anymore because of the
>     crux of your fix - ie. the program will not hang because we
>     immediately unblock the synchronous traps in the crash handler. So
>     maybe we should make that clear in that comment?
>
>
> Okay, I see that now. I modified the comments as you suggested earlier.
>
>>     POSIX itself mentioned only the four, see documentation on
>>     sigprocmask:
>>
>>     "If any of the SIGFPE, SIGILL, SIGSEGV, or SIGBUS signals are
>>     generated while they are blocked, the result is undefined, unless
>>     the signal was generated by the /kill/()
>>     <http://pubs.opengroup.org/onlinepubs/009695399/functions/kill.html> function,
>>     the /sigqueue/()
>>     <http://pubs.opengroup.org/onlinepubs/009695399/functions/sigqueue.html> function,
>>     or the /raise/()
>>     <http://pubs.opengroup.org/onlinepubs/009695399/functions/raise.html> function."
>>
>>     I added SIGTRAP on AIX because it is synchronously delivered too,
>>     but do not know any other signals. I am happy to add any other
>>     signals you know of.
>>
>     Nope, I do not know any more synchronous signals - just wanted to
>     make sure we take a second look and collectively agree that this
>     list is exhaustive.
>
>>     I would like to test it at least twice, to see that signal
>>     handling continues to work after secondary handler ran once. It
>>     needs to correctly unset the signal mask for the synchronous
>>     signals.
>>
>>     Executing it four times may be a bit excessive; if you prefer, I
>>     can remove it for the out_done case.
>>
>>     Note that there is one benefit of repeatedly crashing in the
>>     error handler, it is to stress the stack usage during error
>>     reporting. But I admit that this test could be done much better.
>>
>     It's a bit confusing at first when someone debugs this code and
>     sees four of those secondary signals coming through, so maybe just
>     a small comment here would suffice?
>
>
> I changed the coding to only fire for the log case. That should be 
> less confusing.
>
>>     vmerrors.sh is very minimal. The bug symptom was incomplete or
>>     empty hs-err files, and you would not catch this with the
>>     vmerrors.sh test, would you not? You also can reproduce this bug
>>     only by using both -XX:ErrorHandlerTest and
>>     -XX:TestCrashInErrorHandler together, so it does not quite fit
>>     into what vmerrors.sh does.
>>
>>     At SAP we have a nice test suite which crashes the VM in a number
>>     of ways and then checks whether the hs-err files are complete
>>     (enough). I was planning to add a Jtreg test which does something
>>     similar, but did not add it to this change because I did not want
>>     to make the change larger than necessary.
>     I would think that the added complexity of the test case, would in
>     fact make it easier to review the test with the fix at the same
>     time. Also, since the test in this case is feasible, isn't it
>     recommended to have the fix committed along with its test?
>
>
> I added a small jtreg test. Note that the test cannot really 
> demonstrate the bug without my fix, because it needs the new 
> -XX:TestCrashInErrorHandler options which is part of the fix. But it 
> will show regressions.
>
>     cheers
>
>
> Best Regards, Thomas
>



More information about the hotspot-runtime-dev mailing list