RFR: 8264393: JDK-8258284 introduced dangling TLH race [v2]

Wed Mar 31 16:21:26 UTC 2021

On Wed, 31 Mar 2021 08:04:02 GMT, Robbin Ehn <rehn at openjdk.org> wrote:

>> @dholmes-ora - your nutshell summary is spot on. Thanks for the review.
>
> Hi Dan,
> 
> As you describe it and how it look to me, isn't the issue just that we decrement before before reinstating the old list?
> So if we do first publish the previous list as current list and then decrement the nested handle count it should be okay?
> E.g.:
> _thread->set_threads_hazard_ptr(_previous->_list);
> _list->dec_nested_handle_cnt();
> 
> So you just need to move the "if (_has_ref_count) {" piece of code after the potential reinstating of the previous list.
> 
> Or am I missing something?
> 
> Thanks for finding it!

@robehn - Thanks for reviewing the fix. Yes, I think you have missed something. :-)

I modeled the analysis of this race after one of your favorite race techniques
in my analysis.ThreadsList.for_JBS attachment to the bug report: since there
is nothing to force the two threads to interact with each in a particular order,
I posited delays at various points in the execution of each thread. This quote
from the analysis.ThreadsList.for_JBS attachment describes scenario:

Race note:
    THR-1 is the thread calling the TLH destructors.
    THR-2 is the exiting thread calling ThreadsSMRSupport::free_list.

    If THR-2's ThreadsSMRSupport::free_list() call finishes its scan of of
    the active Threads _threads_hazard_ptr values BEFORE the TLH-2
    destructor code sequence updates THR-1->_threads_hazard_ptr from TL-2
    to TL-1, then TL-2 and not TL-1 will be on the list of in-use
    ThreadsLists:

      // Gather a hash table of the current hazard ptrs:
      ThreadScanHashtable *scan_table = new ThreadScanHashtable(hash_table_size);
      ScanHazardPtrGatherThreadsListClosure scan_cl(scan_table);
      threads_do(&scan_cl);

    At this point, THR-2's ThreadsSMRSupport::free_list() call stalls and
    THR-1 not only finishes the TLH-2 destructor, it also finishes its use
    of TLH-1 as described in the next section and starts to run the TLH-1
    destructor.

After the first ThreadsListHandle is released for THR-1:

+----------------------------------+
| THR-1                            |
+----------------------------------+
| _threads_hazard_ptr=0            |
| _threads_list_ptr=0              |
| _nested_threads_hazard_ptr_cnt=0 |
+----------------------------------+

                                         +----------------------+
                                         | TL-1                 |
                                         +----------------------+
                                         | _length=XXXXXXXXXXXX |
                                         | _next_list=XXXXXXXXX |
                                         | _threads[5]=XXXXXXXX |
                                         | _nested_handle_cnt=X |
                                         +----------------------+

Race note:
    THR-1 is running the TLH-1 destructor and has decremented the TL-1
    _nested_handle_cnt, but stalls before it clears _threads_hazard_ptr.

    The THR-2's ThreadsSMRSupport::free_list() call continues executing and
    checks the _to_delete_list ThreadsLists and if they are not in the
    scan_table and have a _nested_handle_cnt == 0 then, they are freed.

    This is how TL-1 is freed, but still remains in THR-1's
    _threads_hazard_ptr field and can be observed later by THR-2 as a valid
    hazard ptr in its call to smr_delete() on itself or by another thread
    perusing the system ThreadsList. This is especially true after
    ThreadsSMRSupport::free_list() has finished its work and released the
    Threads_lock which will allow another thread to walk the set of hazard
    ptrs.

    THR-1 resumes running again and clears _threads_hazard_ptr. However,
    the other thread walking the set of hazard ptrs has the stale TL-1
    value and tries to use it. Boom!

Switching the decrement:
`_list->dec_nested_handle_cnt()`
to happen after the:
`_thread->set_threads_hazard_ptr(_previous->_list)`
doesn't help because THR-2 observed TL-2 before we
reached that code and then THR-2 stalled until after all
the updates were made. THR-2 recorded TL-2 in the
collection of current hazard ptrs and THR-2 knows nothing
about TL-1 being a valid hazard ptr so THR-2 can free it.

-------------

PR: https://git.openjdk.java.net/jdk/pull/3272