RFR: 8305819: LogConfigurationTest intermittently fails on AArch64

Sat May 6 09:40:19 UTC 2023

On Sat, 6 May 2023 04:40:59 GMT, Dan Gao <duke at openjdk.org> wrote:

>> From `logConfiguration.cpp`
>> 
>> // MT-SAFETY
>> //
>> // The ConfigurationLock guarantees that only one thread is performing reconfiguration. This function still needs
>> // to be MT-safe because logsites in other threads may be executing in parallel. Reconfiguration means unified
>> // logging allows users to dynamically change tags and decorators of a log output via DCMD(logDiagnosticCommand.hpp).
>> //
>> // A RCU-style synchronization 'wait_until_no_readers()' is used inside of 'ts->set_output_level(output, level)'
>> // if a setting has changed. It guarantees that all logs, either synchronous writes or enqueuing to the async buffer
>> // see the new tags and decorators. It's worth noting that the synchronization occurs even if the level does not change.
>> //
>> // LogDecorator is a set of decorators represented in a uint. ts->update_decorators(decorators) is a union of the
>> // current decorators and new_decorators. It's safe to do output->set_decorators(decorators) because new_decorators
>> // is a subset of relevant tagsets decorators. After updating output's decorators, it is still safe to shrink all
>> // decorators of tagsets.
>> //
>> 
>> 
>> So this code is very much expected to be MT-safe. If we have discovered a situation where it is not MT-safe then that needs to be fixed within the existing design involving the `ConfigurationLock` (a semaphore) and the "RCU-style synchronization". I would have to be most suspicious of the latter in terms of missing memory barriers (especially given [JDK-8288904](https://bugs.openjdk.org/browse/JDK-8288904)) , but I don't understand the complete code enough to identify where a problem may be creeping in.
>> 
>> @jdksjolen can you take another look at this and see where the design-flaw, or implementation bug, may be?
>> 
>> Perhaps @navyxliu can offer an opinion too?
>> 
>> Thanks.
>
> Hi, is the current solution okay with you? @dholmes-ora

> @gaogao-mem
> 
> > There is a data race between the writer and reader(s). The writer step (3) and (4) modify the _level_start array and _next of a old node, and the reader step (1) reads from _level_start and _next of a node. This causes a UB. And in current problem, the reader step (2) may read an incorrect _value from a new node.
> 
> This is UB, but the reason isn't the data race.

I disagree. The data race  _is_ the reason. A compiler is allowed to move loads because a data race is UB.

> It's UB because compiler optimizations may...

...do absolutely anything. Once you have undefined behaviour, the entire program is undefined.
The C++ standard doesn't deal with compilers or optimizations, When a program is undefined, it is undefined whatever a compiler does. 

It's  not possible to negotiate with a compiler around UB.

@gaogao-mem  is correct.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/13421#issuecomment-1537102107