RFR: 8305819: LogConfigurationTest intermittently fails on AArch64

gaogao-mem duke at openjdk.org
Tue Apr 25 10:58:26 UTC 2023


On Mon, 24 Apr 2023 10:05:19 GMT, Andrew Haley <aph at openjdk.org> wrote:

>> In the test, two threads read the list while another thread insert to the list . 
>> The writing thread insert to the list as follows:
>> 
>> void LogOutputList::add_output(LogOutput* output, LogLevelType level) {
>>   LogOutputNode* node = new LogOutputNode();
>>   node->_value = output;
>>   node->_level = level;
>> 
>>   // Set the next pointer to the first node of a lower level
>>   for (node->_next = _level_start[level];
>>        node->_next != nullptr && node->_next->_level == level;
>>        node->_next = node->_next->_next) {
>>   }
>> 
>>   // Update the _level_start index
>>   for (int l = LogLevel::Last; l >= level; l--) {
>>     if (_level_start[l] == nullptr || _level_start[l]->_level < level) {
>>       _level_start[l] = node;
>>     }
>>   }
>> 
>>   // Add the node the list
>>   for (LogOutputNode* cur = _level_start[LogLevel::Last]; cur != nullptr; cur = cur->_next) {
>>     if (cur != node && cur->_next == node->_next) {
>>       cur->_next = node;
>>       break;
>>     }
>>   }
>> }
>> 
>> On the writing side, it set the value, level and next pointer of the node, then update the _level_start array(_level_start[i] stores the head node of the list whose level is i) and add the node to the list.
>> 
>> And on the reading side, it reads as follows:
>> 
>> void LogTagSet::log(LogLevelType level, const char* msg) {
>>   // Increasing the atomic reader counter in iterator(level) must
>>   // happen before the creation of LogDecorations instance so
>>   // wait_until_no_readers() in LogConfiguration::configure_output()
>>   // synchronizes _decorations as well. The order is guaranteed by
>>   // the implied memory order of Atomic::add().
>>   LogOutputList::Iterator it = _output_list.iterator(level); // obtain the head node of the list from _level_start[level]
>>   LogDecorations decorations(level, *this, _decorators);
>> 
>>   for (; it != _output_list.end(); it++) {
>>     (*it)->write(decorations, msg); //crash
>>   }
>> }
>> 
>> it++ acquires the next node of the list, and then (*it) reads the value of the node.
>> 
>> In the test, it crash at (*it)->write(decorations, msg), it may due to fault loading on the read side. On the writing side, setting the value of node and adding node to list may be reordered, therefore memory barrier is necessary. While on the reading side, I think getting the pointer of node and reading the value is dependent so memory barrier is unnecessary.
>> By the way, I have tested millions of times and it hasn't failed when just adding memory barrier on the writing side.
>
> I see. I think you're wrong, and I'm going to explain why.
> 
> I don't think we've met before, and I don't know your level of knowledge and experience, so I'm going to go through the problem one step at a time. Please forgive me if I appear to be stating the obvious.
> 
> Firstly, testing doesn't provide any evidence that code is reliable. We've had bugs in hotspot for decades, and the code has appeared to work until one day it mysteriously "stopped working". Some of them involve races just like this one, and I recently had to fix a really nasty one.
> 
> Some definitions
> ----------------
> 
> _Undefined behavior (UB): Renders the entire program meaningless if certain rules of the language are violated. There are no restrictions on the behavior of the program. Examples of undefined behavior are data races... the compiled program is not required to do anything meaningful._ 
> 
> _Threads and data races: When an evaluation of an expression writes to a memory location and another evaluation reads or modifies the same memory location, the expressions are said to conflict. A program that has two conflicting evaluations has a data race unless [certain conditions]_
> 
> _If a data race occurs, the behavior of the program is undefined._ 
> 
> While there are many instances of what is technically UB in hotspot, this is an acknowledged problem. We try to get rid of it whenever we can, and the UB that remains has to be carefully analysed and justified. We may be able in many cases to use known properties of a system to argue that while something is technically UB according to the language standard, it's effectively implementation defined.
> 
> About this patch
> ----------------
> 
> This patch certainly has a data race, and therefore is undefined. There is nothing on the reading side which helps: while dependency ordering works under certain circumstances on the underlying machine, this is shared code, not AArch64-specific, and in any case a C++ complier isn't required to respect dependencies, so any attempt to use dependencies at this level is invalid.
> 
> It don't look to me like this is the only data race in `LogOutputList`. In my opinion, the whole thing really needs more analysis. I think we'd find more bugs.
> 
> As a minimum, I think you need to use a releasing store on the writer side and an acquiring load on the reader side.

Thank you for your guidance and advice. I appreciate the time and effort you took to share your knowledge with me. And I have modified the code according to your suggestion, please take a look.
So far I have only spotted this data race, and it seems atomic operation on the writing and reading side could provide a fix. Moreover, data races related with LogOutputList are likely to cause more potential bugs, which deserves deeper insight. We will try to do more analysis about it later.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/13421#discussion_r1176059783


More information about the hotspot-runtime-dev mailing list