RFR: 8305819: LogConfigurationTest intermittently fails on AArch64

Xin Liu xliu at openjdk.org
Wed May 3 06:46:17 UTC 2023


On Mon, 1 May 2023 22:44:03 GMT, David Holmes <dholmes at openjdk.org> wrote:

>> src/hotspot/share/logging/logOutputList.cpp line 132:
>> 
>>> 130:   // Prevent the reordering of adding node to list and setting the value of node.
>>> 131:   // Such a reordering would lead to reading incorrect values.
>>> 132:   OrderAccess::release();
>> 
>> I agree this is the spot-on. what we need is a storestore barrier.  
>> 
>> We don't want store 'node->_value = output;' floats below stores, such as store '_level_start[l] = node;'
>> 
>> I meant to ask whether 'storestore' barrier itself can resolve this intermittent error on your side?  I just realized that your had storestore barrier in the early revision and this: 
>> 
>>>  By the way, I have tested millions of times and it hasn't failed when just adding memory barrier on the writing side.
>> 
>> So you have verified it, haven't you?
>> 
>> I can help you fill in some backgrounds about why it's not mandatory to use release-acquire.  this RCU hack starts from Linux kernel...
>> 
>> It doesn't use explicit synchronization for scalability. UL chose to use RCU in this scenario because readers are more frequent and need to be scalable. writer who wants to change LogConfiguration is very low-frequent and singleton. RCU shifts almost all costs on the writer.  when writer is changing, it allows readers to uses either old or new nodes in the linked list.
>
> @navyxliu  thank you for looking at this. As you note the RCU code is intended to ensure that a reader can either see the old node, or the new node, and work correctly. In that case  we would logically expect to use release semantics so that if a reader sees the new node then it also sees the correct values for all fields of the node. The release barrier is a `storeStore|loadStore` barrier. Release semantics then require matching acquire semantics when we read the node from the list. You seem to be suggesting that RCU obviates the need for the acquire side of this - is that correct? If so then perhaps the raw storeStore barrier will suffice - as we state in orderAccess.hpp:
>> If not used in such a pair, it is advised to use a membar instead: acquire/release only make sense as pairs.
> 
> Personally I would find the acquire/release easier to reason about as I'm not that familiar with the subtleties of RCU. That said I'm also concerned that this RCU implementation is again found to be flawed due to missing barriers and have to wonder if this is the last of them? If RCU is too difficult to get right then perhaps we should stop trying to be overly clever and use a simpler, potentially slower, lock-based solution?

@dholmes-ora 
> You seem to be suggesting that RCU obviates the need for the acquire side of this - is that correct? If so then perhaps the raw storeStore barrier will suffice - as we state in orderAccess.hpp

yes, that's my understanding.  Release has compound effect, but I still think we only need a storestore barrier.  In the runtime, there's only one active writer.  Other writers are all blocked by ConfigurationLock.  

Even in weak memory model, we don't need to worry about loadStore order because they are either control dependency (line 8) or true data dependency (line 9). 


  1 void LogOutputList::add_output(LogOutput* output, LogLevelType level) {
  2   LogOutputNode* node = new LogOutputNode();
  3   node->_value = output;
  4   node->_level = level;
  5
  6   // Set the next pointer to the first node of a lower level
  7   for (node->_next = _level_start[level];
  8        node->_next != nullptr && node->_next->_level == level;
  9        node->_next = node->_next->_next) {
 10   }
 11 ...
 12   a storestore barrier
 13   publish 'node'


however, we need a storestore barrier at line 12 on all platforms indeed. Even though x86_64 will never reorder 2 stores, C/C++ compiler could. 

I tried 5000 times on a 64-cores arm64 instance, but I can't trigger this error. 
@gaogao-mem what's your preference?  If a storestore barrier can resolve your error, do we want to go to release/acquire pair? 

>  If RCU is too difficult to get right then perhaps we should stop trying to be overly clever and use a simpler, potentially slower, lock-based solution?

The current solution is there before I added 2 tests in JDK-8267952. The unit tests just simulate 'log dmcd' in real life. I look up JBS, 8305819 is the first record we encounter this error. maybe the new microarchitecture of arm64 reorders memory operations more aggressively than before?   

if we replace the current solution with read/write lock, I guess we won't lose much in terms of performance. but this seems a separated task.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/13421#discussion_r1183292263


More information about the hotspot-runtime-dev mailing list