RFR: 8300915: G1: incomplete SATB because nmethod entry barriers don't get armed [v4]
Erik Österlund
eosterlund at openjdk.org
Thu Jan 26 10:26:21 UTC 2023
On Wed, 25 Jan 2023 21:20:40 GMT, Richard Reingruber <rrich at openjdk.org> wrote:
>> The change makes sure all nmethod entry barriers are armed when a G1 concurrent marking cycle starts by doing it unconditionally in `G1ConcurrentMark::pre_concurrent_start()`. The barrier is needed to add oop constants to the SATB.
>>
>> This alone would be a conservative/minimal fix but I didn't like that `CodeCache::is_gc_marking_cycle_active()` can return true after a marking cycle start is undone. I.e. `G1ConcurrentMarkThread::_state == Idle && CodeCache::is_gc_marking_cycle_active()` is possible.
>> Edit: It felt like an inconsistency but maybe its actually ok not to finish the CC marking cycle when the G1 marking is undone. Please comment...
>> The changes in `G1ConcurrentMark::post_concurrent_undo_start()` were made to avoid it.
>>
>>
>> Testing
>>
>> * Rebased this PR onto [JDK-8297487](https://bugs.openjdk.org/browse/JDK-8297487), replaced related assertions with guarantees, and then executed test/langtools:tier1 in a loop for 12h without seeing the issues reported in [JDK-8299956](https://bugs.openjdk.org/browse/JDK-8299956).
>>
>> * CI testing at SAP: This includes most JCK and JTREG tiers 1-4, also in Xcomp mode, on the standard platforms and also on ppc64le.
>>
>> * GHA: build failures because download of apache ant failed. `serviceability/jvmti/vthread/SuspendResumeAll/SuspendResumeAll.java` had a timeout on windows-x64. Should be unrelated. It succeeded in our CI testing.
>
> Richard Reingruber has updated the pull request incrementally with one additional commit since the last revision:
>
> Introduce G1CollectedHeap::finish_codecache_marking_cycle()
Hmm. Some thoughts. The way I look at it, we have essentially 4 different endings to CM when we poke it to "start". 1) successful finish, 2) undone end (no CM needed in the end), 3) aborted by STW full GC, 4) aborted by new CM request.
If I understand this patch correctly, it tries to recognize case 2 as essentiallly being like 1, in the sense that we have traversed the full heap and we are done. Is that correct?
In that case, I think there is still the same issue for case 4. If a CM is aborted because we are interrupting an old CM in order to start a new CM, then we still have a case where we have an old marking cycle that has finished, but not successfully. We can't call G1CollectedHeap::finish_codecache_marking_cycle() until we have successfully marked through all objects in the heap. If we were to call that when an old CM is aborted, then is_unloading() might start returning true on nmethods that were on-stack in some loom continuation object in the old generation. This was the main point of only conditionally starting cycles - then the epoch counters being marked in the nmethods would be conservative and potentially keep nmethods around a bit longer that really could be discarded. Naturally, what I didn't consider at that time, was that not arming all nmethods at the latest initiating marking operation was an SATB violation, when we rely on nmethod entry barriers enforcing SATB constrain
ts. At the time when that code was written, we thought that G1 didn't need that, but then it turned out that it did.
Having said all that, I think what I'm getting at is that this patch fixes the SATB problem, but introduces a new premature nmethod unloading condition for the loom code instead. It's like we want to have the cake and eat it too.
What would of course be super nice is if we didn't abort CMs to start new CMs. I'm not entirely sure why that's a thing in G1. Anyway, if we want to keep on doing that, then I think what we need to do is to explicitly recognize when case 4 happens, and then walk the code cache, mark all nmethods as maybe_on_stack (because the truth is that we don't really know if the GC never got to finish successfully), and then call G1CollectedHeap::finish_codecache_marking_cycle(). Then we can close the gap and solve both the SATB issue and the loom nmethod unloading problems at the same time.
Hope this makes sense, and feel free to ask if something isn't clear in my description. And thanks for looking into this.
-------------
PR: https://git.openjdk.org/jdk/pull/12194
More information about the hotspot-gc-dev
mailing list