RFR: 8305819: LogConfigurationTest intermittently fails on AArch64

Xin Liu xliu at openjdk.org
Thu May 4 17:43:17 UTC 2023


On Thu, 4 May 2023 09:23:53 GMT, Andrew Haley <aph at openjdk.org> wrote:

>> The PR was never approved and was somewhat contentious. I would want to see how UL uses the proposed RWL before considering the integration of both changes. This is not a trivial amount of work I know.
>> 
>> Buried on the other PR was a suggestion to switch the broken UL RCU mechanism for one of the other VM RCU-like mechanisms, but I'm not familiar enough with those other mechanisms to say if that is viable. But perhaps the UL RCU mechanism can be examined against those other ones to see if any more bugs are evident?
>> 
>> Or we just proceed with a suitably commented storeStore barrier and hope that this is the last RCU bug.
>
>> Or we just proceed with a suitably commented storeStore barrier and hope that this is the last RCU bug.
> 
> We can't do only that, because it's still buggy. For it to be correct portable code we need at least a `memory_order::consume` load on the reading side. Given that `consume` is C++20 and we're still on C++14, something like the kernel's `READ_ONCE` would probably do.

hi, @theRealAph 

I learn your comments and Boehm's. Actually, I found that a discussion on control dependency ordering took place in the PR of JDK-8288904 before. The knowledge is non-trivial but I think I got your point.

IMHO, you insist the general rule to a niche case. you are still right, but now the bar is too high to resolve this immediate error. 

1. compiler may flip orders regardless control dependency -- check! you also makes reference of kernel RCU. However, UL configuration is much simpler than C++ standard libraries and Linux kernel. In this case, I think we should take the advice of "Recipient only reads x" from Boehm.  there's a single writer. it's manipulating linked-list. within a CPU, control and data dependency is guaranteed. 

2. I believe that the error reported in JDK-8305819 is caused by hardware instead of compiler optimization.  Evidence: I was told by @gaogao-mem this intermittent crash disappeared after he integrated a storestore barrier. He has ran the tests millions of times and the error never occurs again.  Can we settle that a store-only barrier can practically resolve this error? 

We also provided reasoning why we need a storestore barrier in prior comments. we will put comments in PR. The process is same as JDK-8288904. Reason, merge and resolve the error practically. 

3. I am not saying it's perfect. I agree the current solution is still "buggy" in presence of compiler optimizations.  we never enforce it to the C++ standard so it's "UB". 

That's why we should work on replacing RCU with RWL. Like David pointed out, we need more time to review, test a solution. To achieve that, we need a better baseline. now we know that the concurrent test is broken on the weak memory model. If this bandaid works, I think we will be in the better position for the development.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/13421#discussion_r1185320229


More information about the hotspot-runtime-dev mailing list