RFR: 8293252: Shenandoah: ThreadMXBean synchronizer tests crash with IU+aggressive mode

Thu Sep 15 19:57:44 UTC 2022

On Wed, 14 Sep 2022 22:52:12 GMT, William Kemper <wkemper at openjdk.org> wrote:

>> Are you suggesting we are better off removing this assertion completely, because the assertion that the `value` is not in collection set is also not correct as is evident from this issue.
>
> I'm worried the assertion is correct and the thread dump is doing something wrong. Why is it writing an oop? Should it go through the load reference barrier first and evacuate the object?

That was my first impression as well - that we are missing load reference barrier somewhere. But then I realized we do hit the LRB when the JVM tires to load oop from the OopHandle using `OopHandle::resolve`, as is done here [0]

That prompted me to think there is nothing wrong in `VM_ThreadDump` and the problem lies in the assert conditions, more so because `oop_store_in_heap()` has that additional condition of `is_concurrent_mark_in_progress()`.

With the additional condition I added in this PR, I am currently hitting another assert with the same test case in the same code area. The sequence of events leading to the new assertion failure are:

T1: GC cycle started and marked object A
T2: "Concurrent strong root" is completed which evacuates and updates the non-heap strong roots
T3: Concurrent evacuation started
T4: VMThread running VM_ThreadDump creates a new non-heap strong root R to object A (by creating OopHandle)
T5: VMThread completes VM_ThreadDump operation
T4: Concurrent evacuation moves object A
T5: GC cycle completes
T6: New GC cycle starts (in my case it was because the test was running with "aggressive" heuristics)
T7: Concurrent root marking comes across R and crashes trying to access A which has been evacuated

Initially I was thinking of resolving it by moving "Concurrent strong root" phase after "Concurrent evacuation" so that any new non-heap strong roots which may have been created during evacuation are also properly updated.

But now I am skeptic. It seems the underlying problem for both these asserts is the same - the object was not evacuated when it was handed over to VM_ThreadDump.

`VM_ThreadDump` calls `HeapInspection::find_instances_at_safepoint` to find instances of a particular class at safepoint which uses GC API `object_iterate()`. In case of Shenandoah GC core of this API is in `ObjectIterateScanRootClosure::do_oop_work()` [1] which does not have the code to evacuate the object. So probably the right way to fix these issues is to add a call to `_heap->evacuate_object()` in `ObjectIterateScanRootClosure::do_oop_work`. what do you think?

[0] https://github.com/openjdk/jdk/blob/6beeb8471ccf140e59864d7f983e1d9ad741309c/src/hotspot/share/services/management.cpp#L1288

[1] https://github.com/openjdk/jdk/blob/6beeb8471ccf140e59864d7f983e1d9ad741309c/src/hotspot/share/gc/shenandoah/shenandoahHeap.cpp#L1233

-------------

PR: https://git.openjdk.org/jdk/pull/10268