RFR: 8292407: Improve Weak CAS VarHandle/Unsafe tests resilience under spurious failures [v7]
Aleksey Shipilev
shade at openjdk.org
Tue Aug 30 14:25:56 UTC 2022
On Tue, 30 Aug 2022 14:18:11 GMT, Aleksey Shipilev <shade at openjdk.org> wrote:
>> We have a few reports that existing Weak* VarHandle tests are still flaky, for example on large AArch64 machines or small RISC-V machines.
>>
>> The flakiness is intrinsic to the nature of Weak* operations under tests, that can spuriously fail. The last attempt to fix these was [JDK-8155739](https://bugs.openjdk.org/browse/JDK-8155739). We need to strengthen these a bit more.
>>
>> The actual values depend on the successful testing on known-failing platforms. I ballparked bumping the attempts 5x and introducing the delay would help without exploding test time in worst cases.
>
> Aleksey Shipilev has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 12 commits:
>
> - Rewind back to 100 attempts, 1ms delay
> - Merge branch 'master' into JDK-8292407-varhandle-weak-resilient
> - Merge branch 'master' into JDK-8292407-varhandle-weak-resilient
> - Pull Handles.get out of the weak retry loop
> - Drop weakDelay to 1
> - Merge branch 'master' into JDK-8292407-varhandle-weak-resilient
> - Rework timeouts
> - Merge branch 'master' into JDK-8292407-varhandle-weak-resilient
> - Merge branch 'master' into JDK-8292407-varhandle-weak-resilient
> - Update copyrights
> - ... and 2 more: https://git.openjdk.org/jdk/compare/b3450e93...fd6aa17b
So, I have been playing with HiFive Unmatched board, and I can see that even in the single-threaded mode there weak CASes can spuriously fail when JIT compiler threads are very active at the same time. I suspect L1-tagging LR/SC fails just because we context-switch out a lot.
On HiFive Unmatched, without any delay, it is common to see the long-tailed attempt counts that exceed 200 attempts. Adding the 1ms delay drops that long tail to fit under 10 attempts; I suspect because it provides a natural backoff on over-subscribed system, as thread would return back later if no resources are currently available. Still, that does not seem to hold when the rest of the system is running other parallel tests. Running `java/lang/invoke/VarHandles` with 50 attempts and 1ms delay still ocassionally fails the weak CAS tests.
Running with 100 attempts and 1ms delay seems to pass it well (tried several times) -- which is what current version does. I think this is as good as it gets for a default mode. I also ran other platform tests, and there spurious failures are either not observed at all (x86 and friends), or we have 1..2 retries very rarely (for example, on AArch64 without LSE) . This means the current 100 attempts would almost never be taken on those platforms, and thus we can avoid introducing any platform-specific test config selection.
-------------
PR: https://git.openjdk.org/jdk/pull/9889
More information about the core-libs-dev
mailing list