RFR: 8292407: Improve Weak CAS VarHandle/Unsafe tests resilience under spurious failures [v4]

Mon Aug 22 16:53:08 UTC 2022

On Mon, 22 Aug 2022 13:36:43 GMT, Aleksey Shipilev <shade at openjdk.org> wrote:

> > In general, I appreciate making these tests more resilient. However, I wonder about such large numbers for retry attempts. Smells like more than just sporadic failures. Are we sure there is no bug which causes more failures? Does it really make sense to have a weak implementation on platforms with such high failure rates?
> 
> I suspect we are dealing with the accidental cache line sharing, or context switching, or cache capacity limits that "break" LL/SC weak implementations. Backoff and more retries seems to help to pull ourselves out of this mess.
> 
> As the alternative, we can provide the whitelist of platforms where weak CAS is guaranteed to succeed. (We need to dig through if, for example, AArch64 LSE atomics provide more resilient progress behavior.) That would, unfortunately, stop to verify that some LL/SC implementations _ever_ succeed.

My concern is that we may not notice implementation problems any more when retrying so often. Accidental cache line sharing should better get fixed in the tests if possible. Context switching or cache capacity limits may cause 1 failure, not 100. What do you think?

-------------

PR: https://git.openjdk.org/jdk/pull/9889