RFR: 8305819: LogConfigurationTest intermittently fails on AArch64 [v2]

David Holmes dholmes at openjdk.org
Mon May 8 04:53:22 UTC 2023


On Fri, 5 May 2023 13:15:33 GMT, Dan Gao <duke at openjdk.org> wrote:

>> LogConfigurationTest*reconfigure*MT* crash intermittently on AArch64.
>> According to the crash log and coredump, we found it crash as follows:
>> 
>> void LogTagSet::log(LogLevelType level, const char* msg) {
>>   LogOutputList::Iterator it = _output_list.iterator(level);
>>   LogDecorations decorations(level, *this, _decorators);
>> 
>>   for (; it != _output_list.end(); it++) {
>>     (*it)->write(decorations, msg);//crash 
>>   }
>> }
>> 
>> In the test, two threads write into the log while another thread dynamically changes the decorators and tags. During this time, the  _output_list will be modified. Because of the relax memory model of aarch64, while adding LogOutputNode to LogOutputList, adding node to list and setting the value of node may be reordered, therefore the read thread may not read the correct value of the node's content. Consequently, storestore memory barrier is needed to ensure the order of writing. 
>> By the way, applying this patch may affect performance.
>> 
>> How to reproduce on Linux aarch64:
>> test case
>> 
>> /* @test
>>  * @library /test/lib
>>  * @modules java.base/jdk.internal.misc
>>  *          java.xml
>>  * @run main/native GTestWrapper --gtest_filter=LogConfigurationTest*reconfigure*MT*
>>  */
>> 
>> Crash may occasionally occur after running continuously for 5000 times.
>
> Dan Gao has updated the pull request incrementally with one additional commit since the last revision:
> 
>   comment fixme

If the iteration of the output list falls outside of the RCU reader protection then I think we need to treat it like any other lock-free code. That will at least repair the currently detected bug.

After that we should look at replacing the this "ad-hoc" synchronization (lock + RCU + barriers) with a more direct synchronization mechanism (rwLock?) - which I think is what @theRealAph is advocating for to get rid of any UB associated with RCU.

src/hotspot/share/logging/logOutputList.cpp line 132:

> 130:   // Prevent the reordering of adding node to list and setting the value of node.
> 131:   // Such a reordering would lead to reading incorrect values.
> 132:   OrderAccess::release();

I think if we were to treat this code as the usual pattern of allowing lock-free iteration of a linked-list then we would do a release_store on line 137 and a release_store on line 144. We could then factor the release out ahead of the two loops and place it in the current position. However we would also normally use Atomic::store for the actual stores in the loops, and we would use Atomic::load for the loads in the loops. So:

// To allow lock-free iteration of the output list the updates in the loops
// below require release semantics.
OrderAccess::release();

 // Update the _level_start index
  for (int l = LogLevel::Last; l >= level; l--) {
    LogOutputNode* lnode = Atomic::load(&_level_start[l]);
    if ( lnode == nullptr || lnode->_level < level) {
      Atomic::store(&_level_start[l], node);
    }
  }
  // Add the node the list
  for (LogOutputNode* cur = Atomic::load(&_level_start[LogLevel::Last]); 
        cur != nullptr; 
        cur = Atomic::load(&cur->_next)) {
    if (cur != node && Atomic::load(&cur->_next) == node->_next) {
      Atomic::store(&cur->_next, node);
      break;
    }
}

Logically the intent is that `_level_start` array and the node `_next` field are treated as special "atomic" types and so all accesses must be Atomic::xx. But I suspect in this case that would involve quite a lot of changes.

-------------

Changes requested by dholmes (Reviewer).

PR Review: https://git.openjdk.org/jdk/pull/13421#pullrequestreview-1416099607
PR Review Comment: https://git.openjdk.org/jdk/pull/13421#discussion_r1187010348


More information about the hotspot-runtime-dev mailing list