RFR(XXXS): 8205648 fix for 8205195 breaks secondary error handling

Tue Jun 26 18:02:52 UTC 2018

On 6/26/18 1:09 PM, Thomas Stüfe wrote:
> On Tue, Jun 26, 2018 at 7:06 PM, Thomas Stüfe <thomas.stuefe at gmail.com> wrote:
>> Hi Daniel,
>>
>> test if I understand this correctly:
>>
>> The original bug: we execute
>> NestedThreadsListHandleInErrorHandlingTest, which is supposed to
>> controlled-crash the VM and find SMR related info in hs-err file. But
>> during error reporting, ThreadsSMRSupport::print_info_on() crashes,
>> since internal SMR data structures are modified concurrently and
>> ThreadsSMRSupport::print_info_on() does not lock. The solution was to
>> unconditionally, in VMError::controlled_crash(), grab the
>> Threads_lock.
>>
>> David spotted one path where this may be called recursively: when we
>> test secondary signal handling, we call VMError::controlled_crash()
>> from within the error handler. In that case, the second call to
>> VMError::controlled_crash() will now, instead of crashing, trigger an
>> assert, since we already own Threads_lock.
>>
>> This did not trip up SecondaryErrorTest.java since it was supposed to
>> crash. Now it crashes with an assert, not a secondary signal. Which
>> quietly castrates the test, btw - the reason we contributed this as a
>> regression for running into blocked signals during signal handling,
>> which is a bad thing. Converting this second signal to assert makes
>> the test ineffective.
>>
>> So far correct?
>>
>> ---
>>
>> Yes, the secondary signal test is too coarse. In our port, we have
>> embellished the "Error occurred during error reporting" message and
>> print out the signal number and pc. But we have not yet brought that
>> improvement upstream.
>>
>> I think your patch is okay.
>>
>> But now I wonder if it would be cleaner to try-lock-if-not-owned
>> inside ThreadsSMRSupport::print_info_on() ? Or its caller,
>> Thread::print_on_error()? Or above that, around that call in that one
>> STEP inside VMError::report() ?
>>
> just to be clear, I am fine with the patch in its current form, if you
> want to ship that.

Yup. I got that based on the "I think your patch is okay." above...

Dan

>
> ..Thomas
>
>> Thanks, Thomas
>>
>>
>> On Tue, Jun 26, 2018 at 6:17 PM, Daniel D. Daugherty
>> <daniel.daugherty at oracle.com> wrote:
>>> Thanks for the quick review Serguei!
>>>
>>> One more review would be nice... David H. or Thomas Stüfe?
>>>
>>>
>>> On 6/26/18 12:12 PM, serguei.spitsyn at oracle.com wrote:
>>>> Hi Dan,
>>>>
>>>> The fix looks reasonable to me.
>>>
>>> Thanks. I like simple one liners... :-)
>>>
>>>
>>>> Nice catch by David!
>>>
>>> Definitely.
>>>
>>> Dan
>>>
>>>
>>>
>>>> Thanks,
>>>> Serguei
>>>>
>>>>
>>>> On 6/26/18 06:06, Daniel D. Daugherty wrote:
>>>>> Greetings,
>>>>>
>>>>> For this one liner review, I'd love to hear from David H., Thomas Stüfe,
>>>>> and Serguei Spitsyn.
>>>>>
>>>>> David Holmes spotted a locking problem with my fix for the following bug:
>>>>>
>>>>>      JDK-8205195 NestedThreadsListHandleInErrorHandlingTest fails because
>>>>>                  hs_err doesn't contain _nested_thread_list_max
>>>>>      https://bugs.openjdk.java.net/browse/JDK-8205195
>>>>>
>>>>>      Webrev URL:
>>>>> http://cr.openjdk.java.net/~dcubed/8205195-webrev/0-for-jdk-jdk/
>>>>>
>>>>> The fix for the issue is a one liner that only grabs the Threads_lock
>>>>> when the caller does not already own it:
>>>>>
>>>>> $ hg diff
>>>>> diff -r 5698cf4e50f1 src/hotspot/share/utilities/vmError.cpp
>>>>> --- a/src/hotspot/share/utilities/vmError.cpp Fri Jun 22 12:15:16 2018
>>>>> -0400
>>>>> +++ b/src/hotspot/share/utilities/vmError.cpp Mon Jun 25 21:20:52 2018
>>>>> -0400
>>>>> @@ -1703,7 +1703,7 @@
>>>>>     // from racing with Threads::add() or Threads::remove() as we
>>>>>     // generate the hs_err_pid file. This makes our ErrorHandling tests
>>>>>     // more stable.
>>>>> - MutexLockerEx ml(Threads_lock, Mutex::_no_safepoint_check_flag);
>>>>> + MutexLockerEx ml(Threads_lock->owned_by_self() ? NULL : Threads_lock,
>>>>> Mutex::_no_safepoint_check_flag);
>>>>>
>>>>>     switch (how) {
>>>>>       case 1: vmassert(str == NULL, "expected null"); break;
>>>>>
>>>>>
>>>>> Figuring out a way to detect the failure mode in order to test the
>>>>> fix was much more involved than the fix itself. Gory details are
>>>>> in the bug:
>>>>>
>>>>>      JDK-8205648 fix for 8205195 breaks secondary error handling
>>>>>      https://bugs.openjdk.java.net/browse/JDK-8205648
>>>>>
>>>>> No webrev since the one liner diff is shown above. This fix was tested
>>>>> with the test/hotspot/jtreg/runtime/ErrorHandling tests on Linux-X64
>>>>> and with a special version of ErrorHandling/SecondaryErrorTest.java
>>>>> that is documented in JDK-8205648.
>>>>>
>>>>> Thanks, in advance, for any comments, questions or suggestions.
>>>>>
>>>>> Dan
>>>>