RFR: 8264393: JDK-8258284 introduced dangling TLH race [v2]

Thu Apr 1 07:20:27 UTC 2021

On Wed, 31 Mar 2021 16:24:11 GMT, Daniel D. Daugherty <dcubed at openjdk.org> wrote:

>> @robehn - Thanks for reviewing the fix. Yes, I think you have missed something. :-)
>> 
>> I modeled the analysis of this race after one of your favorite race techniques
>> in my analysis.ThreadsList.for_JBS attachment to the bug report: since there
>> is nothing to force the two threads to interact with each in a particular order,
>> I posited delays at various points in the execution of each thread. This quote
>> from the analysis.ThreadsList.for_JBS attachment describes scenario:
>> 
>> Race note:
>>     THR-1 is the thread calling the TLH destructors.
>>     THR-2 is the exiting thread calling ThreadsSMRSupport::free_list.
>> 
>>     If THR-2's ThreadsSMRSupport::free_list() call finishes its scan of of
>>     the active Threads _threads_hazard_ptr values BEFORE the TLH-2
>>     destructor code sequence updates THR-1->_threads_hazard_ptr from TL-2
>>     to TL-1, then TL-2 and not TL-1 will be on the list of in-use
>>     ThreadsLists:
>> 
>>       // Gather a hash table of the current hazard ptrs:
>>       ThreadScanHashtable *scan_table = new ThreadScanHashtable(hash_table_size);
>>       ScanHazardPtrGatherThreadsListClosure scan_cl(scan_table);
>>       threads_do(&scan_cl);
>> 
>>     At this point, THR-2's ThreadsSMRSupport::free_list() call stalls and
>>     THR-1 not only finishes the TLH-2 destructor, it also finishes its use
>>     of TLH-1 as described in the next section and starts to run the TLH-1
>>     destructor.
>> 
>> After the first ThreadsListHandle is released for THR-1:
>> 
>> +----------------------------------+
>> | THR-1                            |
>> +----------------------------------+
>> | _threads_hazard_ptr=0            |
>> | _threads_list_ptr=0              |
>> | _nested_threads_hazard_ptr_cnt=0 |
>> +----------------------------------+
>> 
>>                                          +----------------------+
>>                                          | TL-1                 |
>>                                          +----------------------+
>>                                          | _length=XXXXXXXXXXXX |
>>                                          | _next_list=XXXXXXXXX |
>>                                          | _threads[5]=XXXXXXXX |
>>                                          | _nested_handle_cnt=X |
>>                                          +----------------------+
>> 
>> Race note:
>>     THR-1 is running the TLH-1 destructor and has decremented the TL-1
>>     _nested_handle_cnt, but stalls before it clears _threads_hazard_ptr.
>> 
>>     The THR-2's ThreadsSMRSupport::free_list() call continues executing and
>>     checks the _to_delete_list ThreadsLists and if they are not in the
>>     scan_table and have a _nested_handle_cnt == 0 then, they are freed.
>> 
>>     This is how TL-1 is freed, but still remains in THR-1's
>>     _threads_hazard_ptr field and can be observed later by THR-2 as a valid
>>     hazard ptr in its call to smr_delete() on itself or by another thread
>>     perusing the system ThreadsList. This is especially true after
>>     ThreadsSMRSupport::free_list() has finished its work and released the
>>     Threads_lock which will allow another thread to walk the set of hazard
>>     ptrs.
>> 
>>     THR-1 resumes running again and clears _threads_hazard_ptr. However,
>>     the other thread walking the set of hazard ptrs has the stale TL-1
>>     value and tries to use it. Boom!
>> 
>> Switching the decrement:
>> `_list->dec_nested_handle_cnt()`
>> to happen after the:
>> `_thread->set_threads_hazard_ptr(_previous->_list)`
>> doesn't help because THR-2 observed TL-2 before we
>> reached that code and then THR-2 stalled until after all
>> the updates were made. THR-2 recorded TL-2 in the
>> collection of current hazard ptrs and THR-2 knows nothing
>> about TL-1 being a valid hazard ptr so THR-2 can free it.
>
> I tested JDK-8264123 together with this fix (JDK-8264393) in Mach5 Tier[1-7]
> and there are no regressions.

Hi Dan, yes thanks.

So I would say, you may not install a ThreadsList into your hazard pointer if it's on the _to_delete_list.
Got it, thanks.

-------------

PR: https://git.openjdk.java.net/jdk/pull/3272