RFR(XXXS): 8205648 fix for 8205195 breaks secondary error handling

Daniel D. Daugherty daniel.daugherty at oracle.com
Tue Jun 26 18:01:52 UTC 2018


Thomas,

Thanks for the review. Replies embedded below...


On 6/26/18 1:06 PM, Thomas Stüfe wrote:
> Hi Daniel,
>
> test if I understand this correctly:
>
> The original bug: we execute
> NestedThreadsListHandleInErrorHandlingTest, which is supposed to
> controlled-crash the VM and find SMR related info in hs-err file. But
> during error reporting, ThreadsSMRSupport::print_info_on() crashes,
> since internal SMR data structures are modified concurrently and
> ThreadsSMRSupport::print_info_on() does not lock. The solution was to
> unconditionally, in VMError::controlled_crash(), grab the
> Threads_lock.
>
> David spotted one path where this may be called recursively: when we
> test secondary signal handling, we call VMError::controlled_crash()
> from within the error handler. In that case, the second call to
> VMError::controlled_crash() will now, instead of crashing, trigger an
> assert, since we already own Threads_lock.
>
> This did not trip up SecondaryErrorTest.java since it was supposed to
> crash. Now it crashes with an assert, not a secondary signal. Which
> quietly castrates the test, btw - the reason we contributed this as a
> regression for running into blocked signals during signal handling,
> which is a bad thing. Converting this second signal to assert makes
> the test ineffective.
>
> So far correct?

Exactly right. :-)


> ---
>
> Yes, the secondary signal test is too coarse. In our port, we have
> embellished the "Error occurred during error reporting" message and
> print out the signal number and pc. But we have not yet brought that
> improvement upstream.

I thought about making the "reason" for the secondary crash available
via the hs_err_pid file, but that is too big of a change for this.


> I think your patch is okay.

Thanks!


> But now I wonder if it would be cleaner to try-lock-if-not-owned
> inside ThreadsSMRSupport::print_info_on() ? Or its caller,
> Thread::print_on_error()? Or above that, around that call in that one
> STEP inside VMError::report() ?

I considered that during the work on JDK-8205195, but that just seems
too risky. Hence the big comment:

1070   // Only grab the Threads_lock if we don't already own it and if we
1071   // are not reporting an error.
1072   // Note: Not grabbing the Threads_lock during during error reporting
1073   // is dangerous because the data structures we want to print can be
1074   // freed concurrently. However, grabbing the Threads_lock during
1075   // error reporting can be equally dangerous since this thread might
1076   // block during error reporting or a nested error could leave the
1077   // Threads_lock held. The classic no win scenario.

So I could do a try-lock-if-not-owned, but then I run into the risk of
leaving the Threads_lock held if something else fails in that code path
and we end up throwing another signal... More Kobiyashi Maru [1]

Again, thanks for the review!

Dan

[1] https://www.urbandictionary.com/define.php?term=Kobayashi%20Maru


>
> Thanks, Thomas
>
>
> On Tue, Jun 26, 2018 at 6:17 PM, Daniel D. Daugherty
> <daniel.daugherty at oracle.com> wrote:
>> Thanks for the quick review Serguei!
>>
>> One more review would be nice... David H. or Thomas Stüfe?
>>
>>
>> On 6/26/18 12:12 PM, serguei.spitsyn at oracle.com wrote:
>>> Hi Dan,
>>>
>>> The fix looks reasonable to me.
>>
>> Thanks. I like simple one liners... :-)
>>
>>
>>> Nice catch by David!
>>
>> Definitely.
>>
>> Dan
>>
>>
>>
>>> Thanks,
>>> Serguei
>>>
>>>
>>> On 6/26/18 06:06, Daniel D. Daugherty wrote:
>>>> Greetings,
>>>>
>>>> For this one liner review, I'd love to hear from David H., Thomas Stüfe,
>>>> and Serguei Spitsyn.
>>>>
>>>> David Holmes spotted a locking problem with my fix for the following bug:
>>>>
>>>>      JDK-8205195 NestedThreadsListHandleInErrorHandlingTest fails because
>>>>                  hs_err doesn't contain _nested_thread_list_max
>>>>      https://bugs.openjdk.java.net/browse/JDK-8205195
>>>>
>>>>      Webrev URL:
>>>> http://cr.openjdk.java.net/~dcubed/8205195-webrev/0-for-jdk-jdk/
>>>>
>>>> The fix for the issue is a one liner that only grabs the Threads_lock
>>>> when the caller does not already own it:
>>>>
>>>> $ hg diff
>>>> diff -r 5698cf4e50f1 src/hotspot/share/utilities/vmError.cpp
>>>> --- a/src/hotspot/share/utilities/vmError.cpp Fri Jun 22 12:15:16 2018
>>>> -0400
>>>> +++ b/src/hotspot/share/utilities/vmError.cpp Mon Jun 25 21:20:52 2018
>>>> -0400
>>>> @@ -1703,7 +1703,7 @@
>>>>     // from racing with Threads::add() or Threads::remove() as we
>>>>     // generate the hs_err_pid file. This makes our ErrorHandling tests
>>>>     // more stable.
>>>> - MutexLockerEx ml(Threads_lock, Mutex::_no_safepoint_check_flag);
>>>> + MutexLockerEx ml(Threads_lock->owned_by_self() ? NULL : Threads_lock,
>>>> Mutex::_no_safepoint_check_flag);
>>>>
>>>>     switch (how) {
>>>>       case 1: vmassert(str == NULL, "expected null"); break;
>>>>
>>>>
>>>> Figuring out a way to detect the failure mode in order to test the
>>>> fix was much more involved than the fix itself. Gory details are
>>>> in the bug:
>>>>
>>>>      JDK-8205648 fix for 8205195 breaks secondary error handling
>>>>      https://bugs.openjdk.java.net/browse/JDK-8205648
>>>>
>>>> No webrev since the one liner diff is shown above. This fix was tested
>>>> with the test/hotspot/jtreg/runtime/ErrorHandling tests on Linux-X64
>>>> and with a special version of ErrorHandling/SecondaryErrorTest.java
>>>> that is documented in JDK-8205648.
>>>>
>>>> Thanks, in advance, for any comments, questions or suggestions.
>>>>
>>>> Dan
>>>



More information about the serviceability-dev mailing list