RFR: 8212933: Thread-SMR: requesting a VM operation whilst holding a ThreadsListHandle can cause deadlocks

Robbin Ehn robbin.ehn at oracle.com
Mon Oct 29 09:05:07 UTC 2018


Hi David,

On 29/10/2018 07:20, David Holmes wrote:
> Hi Robbin,
> 
> On 29/10/2018 6:08 AM, Robbin Ehn wrote:
>> Hi Dan,
>>
>> Thanks for looking at this, here is the update:
>> Inc: http://cr.openjdk.java.net/~rehn/8212933/v2/inc/webrev/
>> Full: http://cr.openjdk.java.net/~rehn/8212933/v2/webrev/
> 
> I can't say I really understand the change in protocol here and why all the 
> cancel operations are no longer needed. I see the handshake VM operations 
> reusing the initial "threads list" but I'm unclear how they might be affected if 
> additional threads are added to the system before the Threads_lock is acquired?

The ThreadsList is a snapshot of all the JavaThreads at that time in the VM.
Handshake all threads only handshake those JavaThreads. We do not care about new
threads.

The typical generic use-case is the similar to RCU. You first update a global
state and emit the handshake when the handshake return no thread can see the old
state.

GlobalFuncPtr = some_new_func;
HandshakeAllThreads;
------------------------------
No thread can be executing the old func.

If the JavaThreads have a local copy of GlobalFuncPtr the handshake operation 
would be to update the local copy to some_new_func.

It works for both Java and for VM resources that respect safepoints.
For a pure VM resource it's much cheaper to use the GlobalCounter.

The Threads_lock must only be held for S/R protocol.
With changes to the S/R protocol, such as using handshake instead, we can remove
Threads_lock for handshakes completely. (with a other small fixes)

The cancel is no longer needed since the terminated threads are visible to the
VM thread when we keep the arming threadslist. We add terminated threads as safe
for handshake. But if we handshake a terminated thread we do not execute the
handshake operation, instead just clear the operation and increment the
completed counter. (the VM thread cancels the operation)

I hope that helped?

> 
> A couple of specific comments:
> 
> src/hotspot/share/runtime/handshake.hpp
> 
> cancel_inner() is dead now.
> 
> ---
> 
> src/hotspot/share/runtime/handshake.cpp
> 
> This was an odd looking for loop before your change and now looks even more 
> strange:
> 
>   for ( ; JavaThread *thr = jtiwh.next(); ) {
> 
> can it not simply be a more normal looking:
> 
>   for (JavaThread *thr = jtiwh.next(); thr != NULL; thr = jtiwh.next()) {
> 
> ?

Thanks, fixed with below patch.

/Robbin

diff -r 5f8b292c473f src/hotspot/share/runtime/handshake.cpp
--- a/src/hotspot/share/runtime/handshake.cpp	Sun Oct 28 20:57:24 2018 +0100
+++ b/src/hotspot/share/runtime/handshake.cpp	Mon Oct 29 09:32:26 2018 +0100
@@ -166,1 +166,1 @@
-    for ( ; JavaThread *thr = jtiwh.next(); ) {
+    for (JavaThread *thr = jtiwh.next(); thr != NULL; thr = jtiwh.next()) {
@@ -198,1 +198,1 @@
-          for ( ; JavaThread *thr = jtiwh.next(); ) {
+          for (JavaThread *thr = jtiwh.next(); thr != NULL; thr = jtiwh.next()) {
diff -r 5f8b292c473f src/hotspot/share/runtime/handshake.hpp
--- a/src/hotspot/share/runtime/handshake.hpp	Sun Oct 28 20:57:24 2018 +0100
+++ b/src/hotspot/share/runtime/handshake.hpp	Mon Oct 29 09:32:26 2018 +0100
@@ -63,1 +62,0 @@
-  void cancel_inner(JavaThread* thread);


> 
> ---
> 
> Thanks,
> David
> 
>> /Robbin
>>
>> On 26/10/2018 17:38, Daniel D. Daugherty wrote:
>>> On 10/26/18 10:33 AM, Robbin Ehn wrote:
>>>> Hi, please review.
>>>>
>>>> When the VM thread executes a handshake it uses different ThreadsLists during
>>>> the execution. A JavaThread that is armed for the handshake when it is already
>>>> in the exit path in VM will cancel the handshake. Even if the VM thread cannot
>>>> see this thread after the initial ThreadsList which where used for arming, the
>>>> handshake can progress when the exiting thread cancels the handshake.
>>>>
>>>> But if a third thread takes a ThreadsList where the exiting JavaThread is 
>>>> present and tries to execute a VM operation, hence waiting on VM thread to 
>>>> finish the handshake, the JavaThread in the exit path can never reach the 
>>>> handshake cancellation point. VM thread cannot finishes the handshake and 
>>>> the third thread is stuck waiting on the VM thread.
>>>>
>>>> To allow holding a ThreadsList when executing a VM operation we instead let the
>>>> VM thread use the same ThreadsList over the entire handshake making all armed
>>>> threads visible to the VM thread at all time. And if VM thread spots a 
>>>> terminated thread it will count that thread is already done by only clearing
>>>> it's operation.
>>>>
>>>> Passes local stress testing, t1-5 and the deadlock is no longer reproduce-able.
>>>> Added a jtreg handshake + thread suspend test as a reproducer.
>>>>
>>>> Issue: https://bugs.openjdk.java.net/browse/JDK-8212933
>>>> Code: http://cr.openjdk.java.net/~rehn/8212933/v1/webrev/
>>>
>>> src/hotspot/share/runtime/handshake.hpp
>>>      No comments.
>>>
>>> src/hotspot/share/runtime/handshake.cpp
>>>      L358: void HandshakeState::process_by_vmthread(JavaThread* target) {
>>>      L359:   assert(Thread::current()->is_VM_thread(), "should call from vm 
>>> thread");
>>>          Both calls to handshake_process_by_vmthread() which calls this
>>>          function are made with the Threads_lock held:
>>>
>>>          MutexLockerEx ml(Threads_lock, Mutex::_no_safepoint_check_flag);
>>>
>>>          Looks like the lock is grabbed because of
>>>          possibly_vmthread_can_process_handshake() which asserts:
>>>
>>>          L351:   // An externally suspended thread cannot be resumed while the
>>>          L352:   // Threads_lock is held so it is safe.
>>>          L353:   // Note that this method is allowed to produce false positives.
>>>          L354:   assert(Threads_lock->owned_by_self(), "Not holding 
>>> Threads_lock.");
>>>          L355:   if (target->is_ext_suspended()) {
>>>          L356:     return true;
>>>          L357:   }
>>>
>>>          Also looks like vmthread_can_process_handshake() needs the
>>>          Threads_lock for the same externally suspended thread check.
>>>
>>>          So I was going to ask that you add:
>>>
>>>          assert(Threads_lock->owned_by_self(), "Not holding Threads_lock.");
>>>
>>>          after L359, but how about a comment instead:
>>>
>>>          // Threads_lock must be held here, but that is assert()ed in
>>>          // possibly_vmthread_can_process_handshake().
>>>
>>>
>>> src/hotspot/share/runtime/thread.hpp
>>>      No comments.
>>>
>>> src/hotspot/share/runtime/thread.cpp
>>>      No comments.
>>>
>>> src/hotspot/share/runtime/threadSMR.cpp
>>>      No comments.
>>>
>>> test/hotspot/jtreg/runtime/handshake/HandshakeWalkSuspendExitTest.java
>>>      Very nice test! It specifically exercises ThreadLocalHandshakes
>>>      with JavaThread suspend/resume. runtime/Thread/SuspendAtExit.java
>>>      only ran into this bug by accident (JDK-8212152) so I like the
>>>      targeted test.
>>>
>>>      L49:         while(!exit_now) {
>>>          nit - please add a space before '('
>>>
>>>      L51:             for (int i = 0; i < _threads.length; i+=2) {
>>>      L58:             for (int i = 0; i < _threads.length; i+=2) {
>>>          nit - please added spaces around '+='
>>>
>>>          So why every other thread? A comment would be good...
>>>
>>>      L52:                 wb.handshakeWalkStack(null, true);
>>>          I'm guessing the 'null' parameter means current thread, but
>>>          that's a guess on my part. A comment would be good.
>>>
>>>      L82:         for (int i = 0; i < _threads.length; i++) {
>>>      L83:             _threads[i].join();
>>>      L84:         }
>>>          Thanks for cleaning up the test_threads. That will make
>>>          the JTREG thread sweeper happy. However, you don't save
>>>          the test_exit_thread references and you don't clean those
>>>          up either. Yes, I realize that they are supposed to exit,
>>>          but if something hangs up on exit, I'd rather have a join()
>>>          hang failure in this test's code than have the JTREG thread
>>>          sweeper catch it.
>>>
>>> Dan
>>>
>>>>
>>>> Thanks, Robbin
>>>


More information about the hotspot-runtime-dev mailing list