RFR: 8373116: Genshen: arraycopy_work should be done unconditionally by arraycopy_marking if the array is in an old region [v5]

Fri Dec 5 23:01:57 UTC 2025

On Fri, 5 Dec 2025 20:00:04 GMT, William Kemper <wkemper at openjdk.org> wrote:

> The issue, as I understand it, is that mutators are racing with the concurrent remembered set scan. If a mutator changes a pointer covered by a dirty card, it could prevent the remembered set scan from tracing the original object that was reachable at the beginning of marking. Since we may not be marking old, we cannot rely on the TAMS for objects in old regions and must unconditionally enqueue all of the overwritten pointers in the old array. Should we only do this when young marking is in progress? Perhaps we should have a version of `arraycopy_work` that only enqueues young pointers here?

I don't think it is related the any racing on remembered set, I got some GC logs from which I think we may know how it actually happens.

[15.653s][info][gc,start       ] GC(188) Pause Full
...
[15.763s][info][gc             ] GC(188) Pause Full 913M->707M(1024M) 109.213ms
[15.767s][info][gc,ergo        ] GC(189) Start GC cycle (Young)
...
[15.802s][info][gc             ] GC(189) Concurrent reset after collect (Young) 1.160ms
[15.802s][info][gc,ergo        ] GC(189) At end of Interrupted Concurrent Young GC: Young generation used: 874M, used regions: 874M, humongous waste: 7066K, soft capacity: 1024M, max capacity: 1022M, available: 99071K
[15.802s][info][gc,ergo        ] GC(189) At end of Interrupted Concurrent Young GC: Old generation used: 1273K, used regions: 1536K, humongous waste: 0B, soft capacity: 1024M, max capacity: 1536K, available: 262K
[15.803s][info][gc,metaspace   ] GC(189) Metaspace: 759K(960K)->759K(960K) NonClass: 721K(832K)->721K(832K) Class: 38K(128K)->38K(128K)
[15.803s][info][gc             ] Trigger (Young): Handle Allocation Failure
[15.803s][info][gc,start       ] GC(190) Pause Full
[15.803s][info][gc,task        ] GC(190) Using 8 of 8 workers for full gc
[15.803s][info][gc,phases,start] GC(190) Phase 1: Mark live objects
[15.806s][info][gc,ref         ] GC(190) Clearing All SoftReferences
<crash happend in full GC) 

1.  188 was a full GC, after 188, all TAMS is reset to bottom in the ShenandoahPostCompactClosure, including all old regions. 
2. 189 was a concurrent young GC, there was array copy barrier executed for an array stored in old, but given TAMS is at the bottom as a result of 188 full GC, the was a no-op due to the bug.
3. 189 finished making, claiming the garbage, but because of the no-op of array copy barrier, the copy was not marked live and was reclaimed.
4. Allocation failure cancelled 189 and escalated to full GC again, but now we have corrupted heap and VM is going to crash.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/28669#issuecomment-3618896157