RFR(S): 8200374: Add ThreadsSMRSupport::verify_hazard_pointer_scanned() to verify threads_do()

Daniel D. Daugherty daniel.daugherty at oracle.com
Wed Apr 4 13:07:02 UTC 2018


On 4/3/18 10:30 PM, David Holmes wrote:
> On 4/04/2018 11:46 AM, Daniel D. Daugherty wrote:
>> On 4/3/18 5:52 PM, Daniel D. Daugherty wrote:
>>> The actual relative diff:
>>>
>>> $ hg diff
>>> diff -r b1cf49ff8e1c src/hotspot/share/trace/tracingExport.cpp
>>> --- a/src/hotspot/share/trace/tracingExport.cpp    Tue Apr 03 
>>> 16:57:57 2018 -0400
>>> +++ b/src/hotspot/share/trace/tracingExport.cpp    Tue Apr 03 
>>> 17:50:26 2018 -0400
>>> @@ -36,5 +36,6 @@
>>>  void TracingExport::set_sampler_thread_with_lock(Thread* thread) {
>>>    // Grab Threads_lock to avoid conflicts with Thread-SMR scans.
>>>    MutexLocker ml(Threads_lock);
>>> -  _sampler_thread = thread;
>>> +  // To match with the lock-free load_acquire():
>>> +  OrderAccess::release_store(&_sampler_thread, thread);
>>>  }
>>>
>>>
>>> The annoying thing is that my plan was to switch from
>>> OrderAccess::release_store_fence() to OrderAccess::release_store()
>>> for this round of testing AND I talked myself into dropping the
>>> release_store() entirely...
>>
>> Sigh... going to OrderAccess::release_store() has brought back
>> a sampler thread race crash that stopped when I switched to
>> OrderAccess::release_store_fence() from the straight up:
>>
>>      _sampler_thread = thread;
>>
>> that I had in CR0.
>>
>> So it looks like I need the "release" to match up to the
>> "acquire" (thanks for reminding me David) AND I also need
>> the fence.
>
> That doesn't sound right because the unlock should be giving you a 
> fence anyway.

One would think so... unless I've run into another one of Dave's
optimizations where it is "safe" to elide the fence. I know I've
run into this before... so I'll take a look at the Mutex code
(again)...

In any case, after switching back to release_store_fence(), I have
not seen a failure... yet... Mach5 is done, but my Solaris-X64
Thread-SMR stress run is still going...

Dan


>
> David
>
>> Don't know if Mach5 is still rocking, but I'm about to find out...
>>
>> Dan
>>
>>
>>>
>>> Good thing Mach5 is rocking today!
>>>
>>> Dan
>>>
>>>
>>> On 4/3/18 5:45 PM, Daniel D. Daugherty wrote:
>>>> On 4/3/18 5:39 PM, David Holmes wrote:
>>>>> On 4/04/2018 7:25 AM, Daniel D. Daugherty wrote:
>>>>>> Greetings,
>>>>>>
>>>>>> This fix has been revised due to feedback from David H. and Erik O.
>>>>>>
>>>>>> Here's the incremental webrev:
>>>>>>
>>>>>> http://cr.openjdk.java.net/~dcubed/8200374-webrev/2_for_jdk_hs_open.inc/ 
>>>>>>
>>>>>>
>>>>>> And here's the full webrev:
>>>>>>
>>>>>> http://cr.openjdk.java.net/~dcubed/8200374-webrev/2_for_jdk_hs_open.full/ 
>>>>>>
>>>>>>
>>>>>> Summary of the changes relative to 'cr1':
>>>>>>
>>>>>> - Rename TracingExport::get_sampler_thread() to 
>>>>>> sampler_thread_acquire()
>>>>>>    and TracingExport::set_sampler_thread() to 
>>>>>> set_sampler_thread_with_lock().
>>>>>> - set_sampler_thread_with_lock() uses Threads_lock to avoid 
>>>>>> conflicts with
>>>>>>    Thread-SMR scans.
>>>>>
>>>>> Sorry Dan but this is now mixing lock-free and locked in a 
>>>>> "broken" way.
>>>>
>>>> Yup... Little voice in the back of my head was telling me
>>>> that... I didn't listen... I'll switch it to:
>>>>
>>>> OrderAccess::release_store(&_sampler_thread, thread);
>>>>
>>>> and retest... :-)
>>>>
>>>> Dan
>>>>
>>>>
>>>>
>>>>> A load_acquire is supposed to always be there to match a 
>>>>> release_store - but we don't have a release_store because we hold 
>>>>> the lock when setting the field. The lock is being used to 
>>>>> serialize with the ThreadSMR scans. If we need an acquire it is 
>>>>> because we want to ensure that we see stores prior to the setting 
>>>>> of the sampler_thread when we read the sampler_thread. If that is 
>>>>> the case then we still need the release_store because we may read 
>>>>> the new sampler_thread value as soon as it is set, but before the 
>>>>> unlock occurs - otherwise prior stores need not be visible. If it 
>>>>> is not the case that we need to see prior stores then we don't 
>>>>> need the acquire.
>>>>>
>>>>> David
>>>>> -----
>>>>>
>>>>>> The Mach5 builds-tier1, jdk-tier[1-3], and hs-tier[1-3] run has 
>>>>>> finished
>>>>>> and there are no test failures related to these changes.
>>>>>>
>>>>>> My Solaris-X64 server is finishing the third round of Thread-SMR 
>>>>>> stress
>>>>>> testing on the 'cr1' version. I'll start a round of stress 
>>>>>> testing for
>>>>>> this version ('cr2') as soon as I can.
>>>>>>
>>>>>> Thanks, in advance, for any comments, suggestions, or feedback.
>>>>>>
>>>>>> Dan
>>>>>>
>>>>>> On 4/2/18 11:11 AM, Daniel D. Daugherty wrote:
>>>>>>> Greetings,
>>>>>>>
>>>>>>> This fix has been revised due to additional testing and due to
>>>>>>> feedback from David H.
>>>>>>>
>>>>>>> Here's the incremental webrev:
>>>>>>>
>>>>>>> http://cr.openjdk.java.net/~dcubed/8200374-webrev/1_for_jdk_hs_open.inc/ 
>>>>>>>
>>>>>>>
>>>>>>> And here's the full webrev:
>>>>>>>
>>>>>>> http://cr.openjdk.java.net/~dcubed/8200374-webrev/1_for_jdk_hs_open.full/ 
>>>>>>>
>>>>>>>
>>>>>>> My usual Thread-SMR stress testing on my Solaris-X64 is still 
>>>>>>> running;
>>>>>>> so far there has been only one unrelated intermittent test 
>>>>>>> failure. The
>>>>>>> Mach5 builds-tier1, jdk-tier[1-3], and hs-tier[1-3] run is still in
>>>>>>> process...
>>>>>>>
>>>>>>> Thanks, in advance, for any comments, suggestions, or feedback.
>>>>>>>
>>>>>>> Dan
>>>>>>>
>>>>>>>
>>>>>>> On 3/28/18 9:52 PM, Daniel D. Daugherty wrote:
>>>>>>>> Greetings,
>>>>>>>>
>>>>>>>> I have a (mostly) small enhancement for Thread-SMR:
>>>>>>>>
>>>>>>>> JDK-8200374 Add 
>>>>>>>> ThreadsSMRSupport::verify_hazard_pointer_scanned() to verify 
>>>>>>>> threads_do()
>>>>>>>> https://bugs.openjdk.java.net/browse/JDK-8200374
>>>>>>>>
>>>>>>>> Here's the webrev URL:
>>>>>>>>
>>>>>>>> http://cr.openjdk.java.net/~dcubed/8200374-webrev/0_for_jdk_hs_open/ 
>>>>>>>>
>>>>>>>>
>>>>>>>> This is Erik O's improvement on the assertion added by the 
>>>>>>>> following
>>>>>>>> bug fix (with some minor tweaking done by me):
>>>>>>>>
>>>>>>>> JDK-8199813 SIGSEGV in ThreadsList::includes()
>>>>>>>> https://bugs.openjdk.java.net/browse/JDK-8199813
>>>>>>>>
>>>>>>>> Summary of the changes:
>>>>>>>>
>>>>>>>> - Replace the assertion that I added in JDK-8199813 with a closure
>>>>>>>>   based function that verifies the threads_do() contract depended
>>>>>>>>   on by Thread-SMR.
>>>>>>>> - Add ThreadsSMRSupport::verify_hazard_pointer_scanned() to verify
>>>>>>>>   that the calling thread's hazard pointer is scanned by 
>>>>>>>> threads_do().
>>>>>>>>   The new function is called from both 
>>>>>>>> ThreadsSMRSupport::acquire_*()
>>>>>>>>   functions.
>>>>>>>> - Refactor the non-JavaThread part of Threads::threads_do() into
>>>>>>>>   Threads::non_java_threads_do() so that the non-JavaThread part
>>>>>>>>   can also be called by other threads_do() functions. Yes, the
>>>>>>>>   Threads::threads_do() contract is still to scan every thread in
>>>>>>>>   the system.
>>>>>>>> - Add hooks for a "tracing sampler thread" to be optionally 
>>>>>>>> scanned
>>>>>>>>   by Threads::non_java_threads_do().
>>>>>>>>
>>>>>>>> This fix has gone thru a couple of Mach5 builds-tier1, 
>>>>>>>> jdk-tier[1-3],
>>>>>>>> and hs-tier[1-3] runs. I've also started my usual 24+ hour 
>>>>>>>> Thread-SMR
>>>>>>>> stress testing run on my Solaris-X64 server.
>>>>>>>>
>>>>>>>> Just to be extra sure, I backed out the fix from JDK-8199813 to
>>>>>>>> JavaThread::verify_not_published() (which started this round of
>>>>>>>> Thread-SMR fixes) and we catch the issue in the new function
>>>>>>>> ThreadsSMRSupport::verify_hazard_pointer_scanned().
>>>>>>>>
>>>>>>>> Thanks, in advance, for any comments, suggestions, or feedback.
>>>>>>>>
>>>>>>>> Dan
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>
>>>
>>>
>>
>



More information about the hotspot-runtime-dev mailing list