RFR: 8293252: Shenandoah: ThreadMXBean synchronizer tests crash with IU+aggressive mode

Fri Sep 16 01:35:49 UTC 2022

On Thu, 15 Sep 2022 23:14:31 GMT, William Kemper <wkemper at openjdk.org> wrote:

>> That was my first impression as well - that we are missing load reference barrier somewhere. But then I realized we do hit the LRB when the JVM tires to load oop from the OopHandle using `OopHandle::resolve`, as is done here [0]
>> 
>> That prompted me to think there is nothing wrong in `VM_ThreadDump` and the problem lies in the assert conditions, more so because `oop_store_in_heap()` has that additional condition of `is_concurrent_mark_in_progress()`.
>> 
>> With the additional condition I added in this PR, I am currently hitting another assert with the same test case in the same code area. The sequence of events leading to the new assertion failure are:
>> 
>> T1: GC cycle started and marked object A
>> T2: "Concurrent strong root" is completed which evacuates and updates the non-heap strong roots
>> T3: Concurrent evacuation started
>> T4: VMThread running VM_ThreadDump creates a new non-heap strong root R to object A (by creating OopHandle)
>> T5: VMThread completes VM_ThreadDump operation
>> T4: Concurrent evacuation moves object A
>> T5: GC cycle completes
>> T6: New GC cycle starts (in my case it was because the test was running with "aggressive" heuristics)
>> T7: Concurrent root marking comes across R and crashes trying to access A which has been evacuated
>> 
>> Initially I was thinking of resolving it by moving "Concurrent strong root" phase after "Concurrent evacuation" so that any new non-heap strong roots which may have been created during evacuation are also properly updated.
>> 
>> But now I am skeptic. It seems the underlying problem for both these asserts is the same - the object was not evacuated when it was handed over to VM_ThreadDump.
>> 
>> `VM_ThreadDump` calls `HeapInspection::find_instances_at_safepoint` to find instances of a particular class at safepoint which uses GC API `object_iterate()`. In case of Shenandoah GC core of this API is in `ObjectIterateScanRootClosure::do_oop_work()` [1] which does not have the code to evacuate the object. So probably the right way to fix these issues is to add a call to `_heap->evacuate_object()` in `ObjectIterateScanRootClosure::do_oop_work`. what do you think?
>> 
>> [0] https://github.com/openjdk/jdk/blob/6beeb8471ccf140e59864d7f983e1d9ad741309c/src/hotspot/share/services/management.cpp#L1288
>> 
>> [1] https://github.com/openjdk/jdk/blob/6beeb8471ccf140e59864d7f983e1d9ad741309c/src/hotspot/share/gc/shenandoah/shenandoahHeap.cpp#L1233
>
> Hmm, `OopHandle::resolve` should have gone through the LRB, which should have evacuated the object. If the object is in the `cset` and it didn't get evacuated by the LRB, then it should be true that the gc is cancelled (out-of-memory during evac). Did this happen under low memory conditions? I wonder if the VM thread did _try_ to evacuate it and fail?

If `OopHandle::resolve()` is hit, then the object gets evacuated through LRB, but the assertions happen before the code calls `resolve`.
The assertion `assert_not_in_cset` is hit when `OopHandle` is created (see frame 16 in the stack trace mentioned [here](https://github.com/openjdk/jdk/pull/10268#issuecomment-1247282078))
The new assertion that I mentioned in previous comment also happens before the `OopHandle` is resolved.

-------------

PR: https://git.openjdk.org/jdk/pull/10268