RFR: 8212933: Thread-SMR: requesting a VM operation whilst holding a ThreadsListHandle can cause deadlocks
serguei.spitsyn at oracle.com
serguei.spitsyn at oracle.com
Mon Oct 29 17:35:04 UTC 2018
Hi Robbin,
+1
Thanks,
Serguei
On 10/29/18 06:52, Daniel D. Daugherty wrote:
> On 10/28/18 4:08 PM, Robbin Ehn wrote:
>> Hi Dan,
>>
>> Thanks for looking at this, here is the update:
>> Inc: http://cr.openjdk.java.net/~rehn/8212933/v2/inc/webrev/
>
> src/hotspot/share/runtime/handshake.cpp
> No comments.
>
> test/hotspot/jtreg/runtime/handshake/HandshakeWalkSuspendExitTest.java
> No comments.
>
> Thumbs up!
>
> Thanks for making the updates.
>
> Dan
>
>
>
>> Full: http://cr.openjdk.java.net/~rehn/8212933/v2/webrev/
>>
>> /Robbin
>>
>> On 26/10/2018 17:38, Daniel D. Daugherty wrote:
>>> On 10/26/18 10:33 AM, Robbin Ehn wrote:
>>>> Hi, please review.
>>>>
>>>> When the VM thread executes a handshake it uses different
>>>> ThreadsLists during
>>>> the execution. A JavaThread that is armed for the handshake when it
>>>> is already
>>>> in the exit path in VM will cancel the handshake. Even if the VM
>>>> thread cannot
>>>> see this thread after the initial ThreadsList which where used for
>>>> arming, the
>>>> handshake can progress when the exiting thread cancels the handshake.
>>>>
>>>> But if a third thread takes a ThreadsList where the exiting
>>>> JavaThread is present and tries to execute a VM operation, hence
>>>> waiting on VM thread to finish the handshake, the JavaThread in the
>>>> exit path can never reach the handshake cancellation point. VM
>>>> thread cannot finishes the handshake and the third thread is stuck
>>>> waiting on the VM thread.
>>>>
>>>> To allow holding a ThreadsList when executing a VM operation we
>>>> instead let the
>>>> VM thread use the same ThreadsList over the entire handshake making
>>>> all armed
>>>> threads visible to the VM thread at all time. And if VM thread
>>>> spots a terminated thread it will count that thread is already done
>>>> by only clearing
>>>> it's operation.
>>>>
>>>> Passes local stress testing, t1-5 and the deadlock is no longer
>>>> reproduce-able.
>>>> Added a jtreg handshake + thread suspend test as a reproducer.
>>>>
>>>> Issue: https://bugs.openjdk.java.net/browse/JDK-8212933
>>>> Code: http://cr.openjdk.java.net/~rehn/8212933/v1/webrev/
>>>
>>> src/hotspot/share/runtime/handshake.hpp
>>> No comments.
>>>
>>> src/hotspot/share/runtime/handshake.cpp
>>> L358: void HandshakeState::process_by_vmthread(JavaThread*
>>> target) {
>>> L359: assert(Thread::current()->is_VM_thread(), "should call
>>> from vm thread");
>>> Both calls to handshake_process_by_vmthread() which calls this
>>> function are made with the Threads_lock held:
>>>
>>> MutexLockerEx ml(Threads_lock,
>>> Mutex::_no_safepoint_check_flag);
>>>
>>> Looks like the lock is grabbed because of
>>> possibly_vmthread_can_process_handshake() which asserts:
>>>
>>> L351: // An externally suspended thread cannot be resumed
>>> while the
>>> L352: // Threads_lock is held so it is safe.
>>> L353: // Note that this method is allowed to produce
>>> false positives.
>>> L354: assert(Threads_lock->owned_by_self(), "Not holding
>>> Threads_lock.");
>>> L355: if (target->is_ext_suspended()) {
>>> L356: return true;
>>> L357: }
>>>
>>> Also looks like vmthread_can_process_handshake() needs the
>>> Threads_lock for the same externally suspended thread check.
>>>
>>> So I was going to ask that you add:
>>>
>>> assert(Threads_lock->owned_by_self(), "Not holding
>>> Threads_lock.");
>>>
>>> after L359, but how about a comment instead:
>>>
>>> // Threads_lock must be held here, but that is assert()ed in
>>> // possibly_vmthread_can_process_handshake().
>>>
>>>
>>> src/hotspot/share/runtime/thread.hpp
>>> No comments.
>>>
>>> src/hotspot/share/runtime/thread.cpp
>>> No comments.
>>>
>>> src/hotspot/share/runtime/threadSMR.cpp
>>> No comments.
>>>
>>> test/hotspot/jtreg/runtime/handshake/HandshakeWalkSuspendExitTest.java
>>> Very nice test! It specifically exercises ThreadLocalHandshakes
>>> with JavaThread suspend/resume. runtime/Thread/SuspendAtExit.java
>>> only ran into this bug by accident (JDK-8212152) so I like the
>>> targeted test.
>>>
>>> L49: while(!exit_now) {
>>> nit - please add a space before '('
>>>
>>> L51: for (int i = 0; i < _threads.length; i+=2) {
>>> L58: for (int i = 0; i < _threads.length; i+=2) {
>>> nit - please added spaces around '+='
>>>
>>> So why every other thread? A comment would be good...
>>>
>>> L52: wb.handshakeWalkStack(null, true);
>>> I'm guessing the 'null' parameter means current thread, but
>>> that's a guess on my part. A comment would be good.
>>>
>>> L82: for (int i = 0; i < _threads.length; i++) {
>>> L83: _threads[i].join();
>>> L84: }
>>> Thanks for cleaning up the test_threads. That will make
>>> the JTREG thread sweeper happy. However, you don't save
>>> the test_exit_thread references and you don't clean those
>>> up either. Yes, I realize that they are supposed to exit,
>>> but if something hangs up on exit, I'd rather have a join()
>>> hang failure in this test's code than have the JTREG thread
>>> sweeper catch it.
>>>
>>> Dan
>>>
>>>>
>>>> Thanks, Robbin
>>>
>
More information about the serviceability-dev
mailing list