RFR: 8154715: Missing destructor and/or TLS clearing calls for terminating threads

Sat May 7 05:45:24 UTC 2016

Brian, Stefan,

I've been giving this more thought and I think you are both right. While 
this can't be completely handled in java_start for all threads it can 
handle the most common cases. If I don't add the deletion of the 
WatcherThread then the problematic case reduces to that of the VMThread.

I'll do a new version on Monday. I also need to double-check those 
threads that don't use java_start (which I am sorely tempted to rename).

Thanks,
David

On 7/05/2016 9:50 AM, David Holmes wrote:
> On 7/05/2016 5:40 AM, Brian Gardner wrote:
>> I agree with Stefan.  When I initially ran into the problem I came up
>> with the following changeset that solves my problem, by calling
>> clear_thread_current at the end of java_start.
>>
>> http://brian.timestudybuddy.com/webrev/hotspot__clear_thread_current_alt/webrev/
>>
>
> As I said I can see the appeal in doing this, but there is still a race
> for the threads destroyed at VM shutdown as current_or_null() can return
> non-NULL but the delete can happen before we call
> thread->clear_current_thread().
>
> You would also want Thread::current_or_null_safe() I think, due to
> already deleted JavaThreads.
>
> And this all side-steps the real issue in my opinion that with clean
> thread lifecycle management we should always be able to delete the
> terminating thread.
>
> So to me it is a question of selecting which set of bandaids you want to
> apply.
>
> Thanks,
> David
>
>> Brian
>>
>>> On May 6, 2016, at 8:41 AM, Stefan Karlsson
>>> <stefan.karlsson at oracle.com <mailto:stefan.karlsson at oracle.com>> wrote:
>>>
>>> On 06/05/16 16:32, David Holmes wrote:
>>>> On 7/05/2016 12:04 AM, Stefan Karlsson wrote:
>>>>> Hi David,
>>>>>
>>>>> On 06/05/16 15:38, David Holmes wrote:
>>>>>> Hi Stefan,
>>>>>>
>>>>>> Thanks for taking a look at this.
>>>>>>
>>>>>> On 6/05/2016 5:02 PM, Stefan Karlsson wrote:
>>>>>>> Hi David,
>>>>>>>
>>>>>>> I looked through the GC part of this webrev and I think the
>>>>>>> change is
>>>>>>> fine.
>>>>>>>
>>>>>>> However, it seems a bit error prone. If we decide to change the
>>>>>>> code to,
>>>>>>> for example, terminate the  AbstractGangWorker threads, then we
>>>>>>> have to
>>>>>>> remember to insert a ThreadLocalStorage::set_thread(NULL) call.
>>>>>>
>>>>>> That's why I added the ShouldNotReachHere()'s - if those threads
>>>>>> start
>>>>>> terminating then we will see those hit. Perhaps a comment:
>>>>>>
>>>>>> ShouldNotReachHere(); // If thread terminates we have to do TLS
>>>>>> cleanup
>>>>>>
>>>>>> ?
>>>>>
>>>>> Yes, I would appreciate a comment. Though, when we add new threads, we
>>>>> need to remember to add the set_thread(NULL) call.
>>>>
>>>> Well no, what you would do is manage your new threads in such a way
>>>> that their run() method can do "delete this" as the last call. Only
>>>> if you can't do that do you need to think about what termination
>>>> logic is missing that needs to be done in lieu of the destructor.
>>>
>>> Yes, but this forces every implementer of a Thread:run() function to
>>> have to think about these kind of requirements.
>>>
>>>>
>>>>>>
>>>>>>> Could we instead add a call to
>>>>>>> ThreadLocalStorage::set_thread(NULL), or
>>>>>>> maybe even Thread::clear_thread_current(), in java_start?
>>>>>>>
>>>>>>> static void *java_start(Thread *thread) {
>>>>>>> [...]
>>>>>>>  thread->initialize_thread_current();
>>>>>>>
>>>>>>> [...]
>>>>>>>
>>>>>>>  // call one more level start routine
>>>>>>>  thread->run();
>>>>>>>
>>>>>>>  ////////// Could we call Thread::clear_thread_current(); here?
>>>>>>
>>>>>> Not easily. For JavaThreads we've already done "delete this" inside
>>>>>> the run() method, so we'd have to move that into java_start as well,
>>>>>> but we can only do the delete for JavaThreads not for other threads.
>>>>>> And we'd also have to change the VMThread and WatcherThread
>>>>>> termination logic because of the deletes that happen in the
>>>>>> termination thread - the "this" pointer (thread above) may no longer
>>>>>> be valid when we want to call clear_current_thread() - which is
>>>>>> why we
>>>>>> can only do the ThreadLocalStorage::set_thread(NULL).
>>>>>>
>>>>>> I agree it would be a lot cleaner to have java_start do:
>>>>>>
>>>>>>  thread->common_initialization();
>>>>>>  thread->run();
>>>>>>  thread->common_cleanup();
>>>>>>  delete thread;
>>>>>>
>>>>>> for all threads, but we'd need a lot of other changes to allow for
>>>>>> that. Otherwise we would need to note that kind of thread before
>>>>>> calling run() then switch on the thread type after run() to decide
>>>>>> what kind of cleanup is necessary and possible. I don't think that
>>>>>> would be better than just doing the "right" cleanup at the end of the
>>>>>> run() methods.
>>>>>
>>>>> I understand that this is a bit messy, and I won't insist that we
>>>>> change
>>>>> this in this RFR, but without looking at this in much detail it sounds
>>>>> weird to delete the thread in run(). Couldn't this be solved by
>>>>> introducing a virtual Thread::post_run() function and do:
>>>>>
>>>>> virtual void Thread::post_run() {
>>>>>  clear_thread_current();
>>>>> }
>>>>>
>>>>> virtual void JavaThread::post_run() {
>>>>>  Thread::post_run();
>>>>>  delete this;
>>>>> }
>>>>
>>>> But again this can't work for the VMThread or WatcherThread as they
>>>> are deleted from the termination thread and so thread->post_run() may
>>>> SEGV.** Plus it is only after the fact that you realize not to put
>>>> "delete this" in Thread::post_run().
>>>
>>> OK, I didn't understand what you meant with "termination thread", but
>>> I now see the call to VMThread::destroy().
>>>
>>> With that said, I find it odd that VMThread::destroy() deletes the VM
>>> thread. We already handshake between the VMThread and the "termination
>>> thread", so why isn't that VMThread::post_run() implemented as:
>>>
>>> virtual void VMThread::post_run() {
>>>  // signal other threads that VM process is gone
>>>  {
>>>    // Note: we must have the _no_safepoint_check_flag. Mutex::lock()
>>> allows
>>>    // VM thread to enter any lock at Safepoint as long as its _owner
>>> is NULL.
>>>    // If that happens after _terminate_lock->wait() has unset _owner
>>>    // but before it actually drops the lock and waits, the
>>> notification below
>>>    // may get lost and we will have a hang. To avoid this, we need to
>>> use
>>>    // Mutex::lock_without_safepoint_check().
>>>    MutexLockerEx ml(_terminate_lock, Mutex::_no_safepoint_check_flag);
>>>    _terminated = true;
>>>    _terminate_lock->notify();
>>>  }
>>>
>>>  Thread::post_run();
>>>  delete this;
>>> }
>>>
>>> And then we wouldn't get a SEGV ...
>>>
>>> I couldn't find the destructor for the WatchThread, but it seems easy
>>> to fix that as well.
>>>
>>> I'm probably missing something, but I find it a bit annoying that code
>>> that should belong to the *Thread:ing system leaks into the
>>> implementations of *Thread::run().
>>>
>>> Thanks,
>>> StefanK
>>>>
>>>> ** Arguably the best solution to the "thread termination races with
>>>> VM termination" problem is to not let the threads terminate. The code
>>>> as it exists today can still have JavaThreads destroying themselves
>>>> at the same that the VM is terminating and potentially hit the same
>>>> errors that require us to not allow the VMThread (and now
>>>> WatcherThread) to delete themselves.
>>>>
>>>> Thanks,
>>>> David
>>>>
>>>>> Thanks,
>>>>> StefanK
>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>> David
>>>>>> ------
>>>>>>
>>>>>>>
>>>>>>>  log_info(os, thread)("Thread finished (tid: " UINTX_FORMAT ",
>>>>>>> pthread
>>>>>>> id: " UINTX_FORMAT ").",
>>>>>>>    os::current_thread_id(), (uintx) pthread_self());
>>>>>>>
>>>>>>>  return 0;
>>>>>>> }
>>>>>>>
>>>>>>> And get rid of the explicit calls to
>>>>>>> ThreadLocalStorage::set_thread(NULL) you added?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> StefanK
>>>>>>>
>>>>>>> On 04/05/16 01:39, David Holmes wrote:
>>>>>>>> This needs attention from GC and runtime folk please.
>>>>>>>>
>>>>>>>> bug: https://bugs.openjdk.java.net/browse/JDK-8154715
>>>>>>>> webrev: http://cr.openjdk.java.net/~dholmes/8154715/webrev/
>>>>>>>>
>>>>>>>> tl;dr: ensure ThreadLocalStorage::set_thread(NULL) is always called
>>>>>>>> before a thread terminates.
>>>>>>>>
>>>>>>>> Background:
>>>>>>>>
>>>>>>>> Most system-related threads do not expect to explicitly terminate,
>>>>>>>> except sometimes as part of VM termination. Such threads don't have
>>>>>>>> their destructors called, but should.
>>>>>>>>
>>>>>>>> This omission came to light due to the ThreadLocalStorage
>>>>>>>> changes in
>>>>>>>> JDK-8132510. As part of that change we deleted the following
>>>>>>>> from the
>>>>>>>> termination path of the VMThread:
>>>>>>>>
>>>>>>>> // Thread destructor usually does this.
>>>>>>>> ThreadLocalStorage::set_thread(NULL);
>>>>>>>>
>>>>>>>> The clearing of TLS seemed irrelevant to the VMThread as it
>>>>>>>> primarily
>>>>>>>> is used to aid in JNI attach/detach. However Brian Gardner
>>>>>>>> reported:
>>>>>>>>
>>>>>>>> http://mail.openjdk.java.net/pipermail/bsd-port-dev/2016-February/002788.html
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> a problem on FreeBSD caused by this change and the interaction with
>>>>>>>> the POSIX  pthread TLS destructor use introduced by JDK-8033696.
>>>>>>>> Because the VMThread terminated without clearing TLS, when the
>>>>>>>> TLS-destructor was called it got into a loop which ran four
>>>>>>>> times (as
>>>>>>>> happens on Linux) and then prints a warning to the console (which
>>>>>>>> doesn't happen on Linux).
>>>>>>>>
>>>>>>>> This indicates we need to restore the:
>>>>>>>>
>>>>>>>> ThreadLocalStorage::set_thread(NULL);
>>>>>>>>
>>>>>>>> but on further consideration it seems to me that this is not
>>>>>>>> confined
>>>>>>>> to the VMThread, and the most appropriate fix would be to always
>>>>>>>> invoke the Thread destructor as a thread terminates.
>>>>>>>>
>>>>>>>> Solution:
>>>>>>>>
>>>>>>>> Further investigation shows that calling the Thread destructor
>>>>>>>> in the
>>>>>>>> thread as it terminates is not possible:
>>>>>>>>
>>>>>>>> - VMThread
>>>>>>>>
>>>>>>>> This is actually destroyed by the thread that terminates the VM,
>>>>>>>> but
>>>>>>>> that can happen after it terminates and so we still hit the TLS
>>>>>>>> problem. The VMThread may be able to destroy itself today but in
>>>>>>>> the
>>>>>>>> past this was not possible (see existing code comment), and in the
>>>>>>>> future it may also not be possible - the problem is that the Thread
>>>>>>>> destructor can interact with other VM subsystems that are
>>>>>>>> concurrently
>>>>>>>> being torn down by the thread that is terminating the VM. In the
>>>>>>>> past
>>>>>>>> this was the CodeHeap. So rather than introduce something that is
>>>>>>>> fragile we stick with the current scheme but restore the
>>>>>>>> ThreadLocalStorage::set_thread(NULL); - note we can't access
>>>>>>>> "this" at
>>>>>>>> that time because it may already have been de-allocated.
>>>>>>>>
>>>>>>>> - WatcherThread
>>>>>>>>
>>>>>>>> The WatcherThread is never destroyed today but has the same
>>>>>>>> problem as
>>>>>>>> the VMThread. We can call the destructor from the VM termination
>>>>>>>> thread (and have implemented that), but not from the WatcherThread
>>>>>>>> itself. So again we just have to restore the
>>>>>>>> ThreadLocalStorage::set_thread(NULL); to fix the potential TLS
>>>>>>>> problem.
>>>>>>>>
>>>>>>>> - GC Threads
>>>>>>>>
>>>>>>>> There are two cases:
>>>>>>>>
>>>>>>>> a) GC threads that never terminate
>>>>>>>>
>>>>>>>> For these we don't need to do anything: we can't delete the
>>>>>>>> thread as
>>>>>>>> it never terminates and we don't hit the TLS problem because it
>>>>>>>> never
>>>>>>>> terminates. So all we will do here is add some logic to check (in
>>>>>>>> NON_PRODUCT) that we do in fact never terminate.
>>>>>>>>
>>>>>>>> b) GC threads that can terminate
>>>>>>>>
>>>>>>>> Despite the fact the threads can terminate, references to those
>>>>>>>> threads are stored elsewhere (WorkGangs and other places) and
>>>>>>>> are not
>>>>>>>> cleared as part of the termination process. Those references can be
>>>>>>>> touched after the thread has terminated so we can not call the
>>>>>>>> destructor at all. So again all we can do (without some major
>>>>>>>> thread
>>>>>>>> management reworking) is ensure that
>>>>>>>> ThreadLocalStorage::set_thread(NULL); is called before the thread
>>>>>>>> actually terminates
>>>>>>>>
>>>>>>>> Testing: JPRT
>>>>>>>>         RBT - runtime nightly tests
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> David
>>