RFR: 8305819: LogConfigurationTest intermittently fails on AArch64

Fri Apr 28 23:40:53 UTC 2023

On Thu, 27 Apr 2023 06:57:51 GMT, gaogao-mem <duke at openjdk.org> wrote:

>> Is the solution as simple as the test needing to use the ConfigurationLock?
>
> @dholmes-ora Thank you for your comment. I agree with your opinion that the API here needs better design and implementation. In my humble opinion, if we need to use locks, I would prefer to use a readwrite lock. Personally, I lean towards a lock-free implementation, but as a beginner, I'm not fully familiar with the overall design and implementation of the UL, so it may require more time. The current crash can be resolved with this patch, so I think maybe we can first apply this patch to solve the problem we are facing now. If possible, I'm willing to propose a more thorough solution in a follow up patch, but it may take some time.

hi, @gaogao-mem 

I am trying to understand your analysis. let me debrief the test first. It would be helpful for others to reason the correctness of your work. 

There are 3 threads involved in `reconfigure_decorators_MT`: 2 readers and 1 writer.  2 readers keep emitting logs. The singler writer updates log configurations. 

    // Take turn logging with different decorators, either None or All.
    set_log_config(TestLogFileName, "logging=debug", "none");
    set_log_config(TestLogFileName, "logging=debug", _all_decorators);

Despite the log tag 'logging=debug' seems same, `LogConfiguration::configure_output` still updates it anyway at `ts->set_output_level(output, level)`. here ts refers to tagset 'logging' and level is debug.

Since the node pre-exists in the output list, hotspot invokes `update_output_level(node, level)`.  Since level is same, it amounts to clone the node in the output list. At the same time, it's possible that 2 readers are seeing the old node. If any reader is in action, _active_readers is greater than 0. 

void LogOutputList::set_output_level(LogOutput* output, LogLevelType level) {
  LogOutputNode* node = find(output);
  if (level == LogLevel::Off && node != nullptr) {
    remove_output(node);
  } else if (level != LogLevel::Off && node == nullptr) {
    add_output(output, level);
  } else if (node != nullptr) {
    update_output_level(node, level);
  }
}

update_output_level() is very subtle but it follow the RCU idiom. The basic ideas is to add a new node, wait for quiescent state and purge the deprecated one.

  1 void LogOutputList::update_output_level(LogOutputList::LogOutputNode* node, LogLevelType level) {
  2   add_output(node->_value, level);
  3   wait_until_no_readers();
  4   remove_output(node);
  5 }
``` 

> During this time, the _output_list will be modified. Because of the relax memory model of aarch64, while adding LogOutputNode to LogOutputList, adding node to list and setting the value of node may be reordered, therefore the read thread may not read the correct value of the node's content. Consequently, storestore memory barrier is needed to ensure the order of writing.

I agree on this. x86 will never reorder stores, but aarch64 could. We can't publish premature node no matter what. 

In between 2 and 3, readers may see either the old or the new node by design. it assumes that load/store of an aligned pointer is atomic. I can't follow your change in LogOutputList::Iterator. I don't think we need acquire for _level_start[level] in reader.  also I don't understand why you have to add volatile qualifier for those fields. 

**To make RCU work correctly, we  must ensure that readers only see the new node after line 3.**  In wait_until_no_readers(), there's a storeload() barrier. I think it enforces that.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/13421#issuecomment-1528188010