RFR: 8293252: Shenandoah: ThreadMXBean synchronizer tests crash with IU+aggressive mode

Thu Sep 15 23:16:42 UTC 2022

On Thu, 15 Sep 2022 19:55:30 GMT, Ashutosh Mehra <duke at openjdk.org> wrote:

>> I'm worried the assertion is correct and the thread dump is doing something wrong. Why is it writing an oop? Should it go through the load reference barrier first and evacuate the object?
>
> That was my first impression as well - that we are missing load reference barrier somewhere. But then I realized we do hit the LRB when the JVM tires to load oop from the OopHandle using `OopHandle::resolve`, as is done here [0]
> 
> That prompted me to think there is nothing wrong in `VM_ThreadDump` and the problem lies in the assert conditions, more so because `oop_store_in_heap()` has that additional condition of `is_concurrent_mark_in_progress()`.
> 
> With the additional condition I added in this PR, I am currently hitting another assert with the same test case in the same code area. The sequence of events leading to the new assertion failure are:
> 
> T1: GC cycle started and marked object A
> T2: "Concurrent strong root" is completed which evacuates and updates the non-heap strong roots
> T3: Concurrent evacuation started
> T4: VMThread running VM_ThreadDump creates a new non-heap strong root R to object A (by creating OopHandle)
> T5: VMThread completes VM_ThreadDump operation
> T4: Concurrent evacuation moves object A
> T5: GC cycle completes
> T6: New GC cycle starts (in my case it was because the test was running with "aggressive" heuristics)
> T7: Concurrent root marking comes across R and crashes trying to access A which has been evacuated
> 
> Initially I was thinking of resolving it by moving "Concurrent strong root" phase after "Concurrent evacuation" so that any new non-heap strong roots which may have been created during evacuation are also properly updated.
> 
> But now I am skeptic. It seems the underlying problem for both these asserts is the same - the object was not evacuated when it was handed over to VM_ThreadDump.
> 
> `VM_ThreadDump` calls `HeapInspection::find_instances_at_safepoint` to find instances of a particular class at safepoint which uses GC API `object_iterate()`. In case of Shenandoah GC core of this API is in `ObjectIterateScanRootClosure::do_oop_work()` [1] which does not have the code to evacuate the object. So probably the right way to fix these issues is to add a call to `_heap->evacuate_object()` in `ObjectIterateScanRootClosure::do_oop_work`. what do you think?
> 
> [0] https://github.com/openjdk/jdk/blob/6beeb8471ccf140e59864d7f983e1d9ad741309c/src/hotspot/share/services/management.cpp#L1288
> 
> [1] https://github.com/openjdk/jdk/blob/6beeb8471ccf140e59864d7f983e1d9ad741309c/src/hotspot/share/gc/shenandoah/shenandoahHeap.cpp#L1233

Hmm, `OopHandle::resolve` should have gone through the LRB, which should have evacuated the object. If the object is in the `cset` and it didn't get evacuated by the LRB, then it should be true that the gc is cancelled (out-of-memory during evac). Did this happen under low memory conditions? I wonder if the VM thread did _try_ to evacuate it and fail?

-------------

PR: https://git.openjdk.org/jdk/pull/10268