RFR: 8212933: Thread-SMR: requesting a VM operation whilst holding a ThreadsListHandle can cause deadlocks

Daniel D. Daugherty daniel.daugherty at oracle.com
Mon Oct 29 13:52:32 UTC 2018


On 10/28/18 4:08 PM, Robbin Ehn wrote:
> Hi Dan,
>
> Thanks for looking at this, here is the update:
> Inc: http://cr.openjdk.java.net/~rehn/8212933/v2/inc/webrev/

src/hotspot/share/runtime/handshake.cpp
     No comments.

test/hotspot/jtreg/runtime/handshake/HandshakeWalkSuspendExitTest.java
     No comments.

Thumbs up!

Thanks for making the updates.

Dan



> Full: http://cr.openjdk.java.net/~rehn/8212933/v2/webrev/
>
> /Robbin
>
> On 26/10/2018 17:38, Daniel D. Daugherty wrote:
>> On 10/26/18 10:33 AM, Robbin Ehn wrote:
>>> Hi, please review.
>>>
>>> When the VM thread executes a handshake it uses different 
>>> ThreadsLists during
>>> the execution. A JavaThread that is armed for the handshake when it 
>>> is already
>>> in the exit path in VM will cancel the handshake. Even if the VM 
>>> thread cannot
>>> see this thread after the initial ThreadsList which where used for 
>>> arming, the
>>> handshake can progress when the exiting thread cancels the handshake.
>>>
>>> But if a third thread takes a ThreadsList where the exiting 
>>> JavaThread is present and tries to execute a VM operation, hence 
>>> waiting on VM thread to finish the handshake, the JavaThread in the 
>>> exit path can never reach the handshake cancellation point. VM 
>>> thread cannot finishes the handshake and the third thread is stuck 
>>> waiting on the VM thread.
>>>
>>> To allow holding a ThreadsList when executing a VM operation we 
>>> instead let the
>>> VM thread use the same ThreadsList over the entire handshake making 
>>> all armed
>>> threads visible to the VM thread at all time. And if VM thread spots 
>>> a terminated thread it will count that thread is already done by 
>>> only clearing
>>> it's operation.
>>>
>>> Passes local stress testing, t1-5 and the deadlock is no longer 
>>> reproduce-able.
>>> Added a jtreg handshake + thread suspend test as a reproducer.
>>>
>>> Issue: https://bugs.openjdk.java.net/browse/JDK-8212933
>>> Code: http://cr.openjdk.java.net/~rehn/8212933/v1/webrev/
>>
>> src/hotspot/share/runtime/handshake.hpp
>>      No comments.
>>
>> src/hotspot/share/runtime/handshake.cpp
>>      L358: void HandshakeState::process_by_vmthread(JavaThread* 
>> target) {
>>      L359:   assert(Thread::current()->is_VM_thread(), "should call 
>> from vm thread");
>>          Both calls to handshake_process_by_vmthread() which calls this
>>          function are made with the Threads_lock held:
>>
>>          MutexLockerEx ml(Threads_lock, 
>> Mutex::_no_safepoint_check_flag);
>>
>>          Looks like the lock is grabbed because of
>>          possibly_vmthread_can_process_handshake() which asserts:
>>
>>          L351:   // An externally suspended thread cannot be resumed 
>> while the
>>          L352:   // Threads_lock is held so it is safe.
>>          L353:   // Note that this method is allowed to produce false 
>> positives.
>>          L354:   assert(Threads_lock->owned_by_self(), "Not holding 
>> Threads_lock.");
>>          L355:   if (target->is_ext_suspended()) {
>>          L356:     return true;
>>          L357:   }
>>
>>          Also looks like vmthread_can_process_handshake() needs the
>>          Threads_lock for the same externally suspended thread check.
>>
>>          So I was going to ask that you add:
>>
>>          assert(Threads_lock->owned_by_self(), "Not holding 
>> Threads_lock.");
>>
>>          after L359, but how about a comment instead:
>>
>>          // Threads_lock must be held here, but that is assert()ed in
>>          // possibly_vmthread_can_process_handshake().
>>
>>
>> src/hotspot/share/runtime/thread.hpp
>>      No comments.
>>
>> src/hotspot/share/runtime/thread.cpp
>>      No comments.
>>
>> src/hotspot/share/runtime/threadSMR.cpp
>>      No comments.
>>
>> test/hotspot/jtreg/runtime/handshake/HandshakeWalkSuspendExitTest.java
>>      Very nice test! It specifically exercises ThreadLocalHandshakes
>>      with JavaThread suspend/resume. runtime/Thread/SuspendAtExit.java
>>      only ran into this bug by accident (JDK-8212152) so I like the
>>      targeted test.
>>
>>      L49:         while(!exit_now) {
>>          nit - please add a space before '('
>>
>>      L51:             for (int i = 0; i < _threads.length; i+=2) {
>>      L58:             for (int i = 0; i < _threads.length; i+=2) {
>>          nit - please added spaces around '+='
>>
>>          So why every other thread? A comment would be good...
>>
>>      L52:                 wb.handshakeWalkStack(null, true);
>>          I'm guessing the 'null' parameter means current thread, but
>>          that's a guess on my part. A comment would be good.
>>
>>      L82:         for (int i = 0; i < _threads.length; i++) {
>>      L83:             _threads[i].join();
>>      L84:         }
>>          Thanks for cleaning up the test_threads. That will make
>>          the JTREG thread sweeper happy. However, you don't save
>>          the test_exit_thread references and you don't clean those
>>          up either. Yes, I realize that they are supposed to exit,
>>          but if something hangs up on exit, I'd rather have a join()
>>          hang failure in this test's code than have the JTREG thread
>>          sweeper catch it.
>>
>> Dan
>>
>>>
>>> Thanks, Robbin
>>



More information about the serviceability-dev mailing list