RFR: 8292407: Improve Weak CAS VarHandle/Unsafe tests resilience under spurious failures [v4]

Mon Aug 22 17:10:38 UTC 2022

On Mon, 22 Aug 2022 16:49:34 GMT, Martin Doerr <mdoerr at openjdk.org> wrote:

> My concern is that we may not notice implementation problems any more when retrying so often. Accidental cache line sharing should better get fixed in the tests if possible. Context switching or cache capacity limits may cause 1 failure, not 100. What do you think?

I would not dare to say that 100 retries is going to be enough for everything; I don't even dare to say that is "too many", my test experiments ran with tens of thousands retries before.

Here is a thing, though: for weak LL/SC hardware, we would enter the same kind (but unbounded!) retry loop inside strong CAS implementation. If the implementation is laggy, we would spin there a lot. In fact, I don't think anyone actually measured how much retries we need to succeed at strong CAS on such platforms. But as long as such CAS eventually succeeds, it looks to be a performance question, not the functionality one. The tests affected by this PR are testing the functional part: weak CAS _eventually_ succeeds after aggressive retries/backoffs.

The failing platforms I have (RISC-V) are remarkably slow to dissect what might be going on there. We can dive there, for sure, but I reckon that would take weeks to resolve. It can wait, if you feel strongly about it.

-------------

PR: https://git.openjdk.org/jdk/pull/9889