RFR: 8212933: Thread-SMR: requesting a VM operation whilst holding a ThreadsListHandle can cause deadlocks

David Holmes david.holmes at oracle.com
Tue Oct 30 01:31:51 UTC 2018


Thanks for the explanation Robbin.

The inline patch also seems fine. I hope the other reviewers noticed it.

David

On 29/10/2018 7:05 PM, Robbin Ehn wrote:
> Hi David,
> 
> On 29/10/2018 07:20, David Holmes wrote:
>> Hi Robbin,
>>
>> On 29/10/2018 6:08 AM, Robbin Ehn wrote:
>>> Hi Dan,
>>>
>>> Thanks for looking at this, here is the update:
>>> Inc: http://cr.openjdk.java.net/~rehn/8212933/v2/inc/webrev/
>>> Full: http://cr.openjdk.java.net/~rehn/8212933/v2/webrev/
>>
>> I can't say I really understand the change in protocol here and why 
>> all the cancel operations are no longer needed. I see the handshake VM 
>> operations reusing the initial "threads list" but I'm unclear how they 
>> might be affected if additional threads are added to the system before 
>> the Threads_lock is acquired?
> 
> The ThreadsList is a snapshot of all the JavaThreads at that time in the 
> VM.
> Handshake all threads only handshake those JavaThreads. We do not care 
> about new
> threads.
> 
> The typical generic use-case is the similar to RCU. You first update a 
> global
> state and emit the handshake when the handshake return no thread can see 
> the old
> state.
> 
> GlobalFuncPtr = some_new_func;
> HandshakeAllThreads;
> ------------------------------
> No thread can be executing the old func.
> 
> If the JavaThreads have a local copy of GlobalFuncPtr the handshake 
> operation would be to update the local copy to some_new_func.
> 
> It works for both Java and for VM resources that respect safepoints.
> For a pure VM resource it's much cheaper to use the GlobalCounter.
> 
> The Threads_lock must only be held for S/R protocol.
> With changes to the S/R protocol, such as using handshake instead, we 
> can remove
> Threads_lock for handshakes completely. (with a other small fixes)
> 
> The cancel is no longer needed since the terminated threads are visible 
> to the
> VM thread when we keep the arming threadslist. We add terminated threads 
> as safe
> for handshake. But if we handshake a terminated thread we do not execute 
> the
> handshake operation, instead just clear the operation and increment the
> completed counter. (the VM thread cancels the operation)
> 
> I hope that helped?
> 
>>
>> A couple of specific comments:
>>
>> src/hotspot/share/runtime/handshake.hpp
>>
>> cancel_inner() is dead now.
>>
>> ---
>>
>> src/hotspot/share/runtime/handshake.cpp
>>
>> This was an odd looking for loop before your change and now looks even 
>> more strange:
>>
>>   for ( ; JavaThread *thr = jtiwh.next(); ) {
>>
>> can it not simply be a more normal looking:
>>
>>   for (JavaThread *thr = jtiwh.next(); thr != NULL; thr = jtiwh.next()) {
>>
>> ?
> 
> Thanks, fixed with below patch.
> 
> /Robbin
> 
> diff -r 5f8b292c473f src/hotspot/share/runtime/handshake.cpp
> --- a/src/hotspot/share/runtime/handshake.cpp    Sun Oct 28 20:57:24 
> 2018 +0100
> +++ b/src/hotspot/share/runtime/handshake.cpp    Mon Oct 29 09:32:26 
> 2018 +0100
> @@ -166,1 +166,1 @@
> -    for ( ; JavaThread *thr = jtiwh.next(); ) {
> +    for (JavaThread *thr = jtiwh.next(); thr != NULL; thr = 
> jtiwh.next()) {
> @@ -198,1 +198,1 @@
> -          for ( ; JavaThread *thr = jtiwh.next(); ) {
> +          for (JavaThread *thr = jtiwh.next(); thr != NULL; thr = 
> jtiwh.next()) {
> diff -r 5f8b292c473f src/hotspot/share/runtime/handshake.hpp
> --- a/src/hotspot/share/runtime/handshake.hpp    Sun Oct 28 20:57:24 
> 2018 +0100
> +++ b/src/hotspot/share/runtime/handshake.hpp    Mon Oct 29 09:32:26 
> 2018 +0100
> @@ -63,1 +62,0 @@
> -  void cancel_inner(JavaThread* thread);
> 
> 
>>
>> ---
>>
>> Thanks,
>> David
>>
>>> /Robbin
>>>
>>> On 26/10/2018 17:38, Daniel D. Daugherty wrote:
>>>> On 10/26/18 10:33 AM, Robbin Ehn wrote:
>>>>> Hi, please review.
>>>>>
>>>>> When the VM thread executes a handshake it uses different 
>>>>> ThreadsLists during
>>>>> the execution. A JavaThread that is armed for the handshake when it 
>>>>> is already
>>>>> in the exit path in VM will cancel the handshake. Even if the VM 
>>>>> thread cannot
>>>>> see this thread after the initial ThreadsList which where used for 
>>>>> arming, the
>>>>> handshake can progress when the exiting thread cancels the handshake.
>>>>>
>>>>> But if a third thread takes a ThreadsList where the exiting 
>>>>> JavaThread is present and tries to execute a VM operation, hence 
>>>>> waiting on VM thread to finish the handshake, the JavaThread in the 
>>>>> exit path can never reach the handshake cancellation point. VM 
>>>>> thread cannot finishes the handshake and the third thread is stuck 
>>>>> waiting on the VM thread.
>>>>>
>>>>> To allow holding a ThreadsList when executing a VM operation we 
>>>>> instead let the
>>>>> VM thread use the same ThreadsList over the entire handshake making 
>>>>> all armed
>>>>> threads visible to the VM thread at all time. And if VM thread 
>>>>> spots a terminated thread it will count that thread is already done 
>>>>> by only clearing
>>>>> it's operation.
>>>>>
>>>>> Passes local stress testing, t1-5 and the deadlock is no longer 
>>>>> reproduce-able.
>>>>> Added a jtreg handshake + thread suspend test as a reproducer.
>>>>>
>>>>> Issue: https://bugs.openjdk.java.net/browse/JDK-8212933
>>>>> Code: http://cr.openjdk.java.net/~rehn/8212933/v1/webrev/
>>>>
>>>> src/hotspot/share/runtime/handshake.hpp
>>>>      No comments.
>>>>
>>>> src/hotspot/share/runtime/handshake.cpp
>>>>      L358: void HandshakeState::process_by_vmthread(JavaThread* 
>>>> target) {
>>>>      L359:   assert(Thread::current()->is_VM_thread(), "should call 
>>>> from vm thread");
>>>>          Both calls to handshake_process_by_vmthread() which calls this
>>>>          function are made with the Threads_lock held:
>>>>
>>>>          MutexLockerEx ml(Threads_lock, 
>>>> Mutex::_no_safepoint_check_flag);
>>>>
>>>>          Looks like the lock is grabbed because of
>>>>          possibly_vmthread_can_process_handshake() which asserts:
>>>>
>>>>          L351:   // An externally suspended thread cannot be resumed 
>>>> while the
>>>>          L352:   // Threads_lock is held so it is safe.
>>>>          L353:   // Note that this method is allowed to produce 
>>>> false positives.
>>>>          L354:   assert(Threads_lock->owned_by_self(), "Not holding 
>>>> Threads_lock.");
>>>>          L355:   if (target->is_ext_suspended()) {
>>>>          L356:     return true;
>>>>          L357:   }
>>>>
>>>>          Also looks like vmthread_can_process_handshake() needs the
>>>>          Threads_lock for the same externally suspended thread check.
>>>>
>>>>          So I was going to ask that you add:
>>>>
>>>>          assert(Threads_lock->owned_by_self(), "Not holding 
>>>> Threads_lock.");
>>>>
>>>>          after L359, but how about a comment instead:
>>>>
>>>>          // Threads_lock must be held here, but that is assert()ed in
>>>>          // possibly_vmthread_can_process_handshake().
>>>>
>>>>
>>>> src/hotspot/share/runtime/thread.hpp
>>>>      No comments.
>>>>
>>>> src/hotspot/share/runtime/thread.cpp
>>>>      No comments.
>>>>
>>>> src/hotspot/share/runtime/threadSMR.cpp
>>>>      No comments.
>>>>
>>>> test/hotspot/jtreg/runtime/handshake/HandshakeWalkSuspendExitTest.java
>>>>      Very nice test! It specifically exercises ThreadLocalHandshakes
>>>>      with JavaThread suspend/resume. runtime/Thread/SuspendAtExit.java
>>>>      only ran into this bug by accident (JDK-8212152) so I like the
>>>>      targeted test.
>>>>
>>>>      L49:         while(!exit_now) {
>>>>          nit - please add a space before '('
>>>>
>>>>      L51:             for (int i = 0; i < _threads.length; i+=2) {
>>>>      L58:             for (int i = 0; i < _threads.length; i+=2) {
>>>>          nit - please added spaces around '+='
>>>>
>>>>          So why every other thread? A comment would be good...
>>>>
>>>>      L52:                 wb.handshakeWalkStack(null, true);
>>>>          I'm guessing the 'null' parameter means current thread, but
>>>>          that's a guess on my part. A comment would be good.
>>>>
>>>>      L82:         for (int i = 0; i < _threads.length; i++) {
>>>>      L83:             _threads[i].join();
>>>>      L84:         }
>>>>          Thanks for cleaning up the test_threads. That will make
>>>>          the JTREG thread sweeper happy. However, you don't save
>>>>          the test_exit_thread references and you don't clean those
>>>>          up either. Yes, I realize that they are supposed to exit,
>>>>          but if something hangs up on exit, I'd rather have a join()
>>>>          hang failure in this test's code than have the JTREG thread
>>>>          sweeper catch it.
>>>>
>>>> Dan
>>>>
>>>>>
>>>>> Thanks, Robbin
>>>>


More information about the serviceability-dev mailing list