RFR: 8300915: G1: incomplete SATB because nmethod entry barriers don't get armed [v4]

Fri Jan 27 09:10:19 UTC 2023

On Thu, 26 Jan 2023 16:04:19 GMT, Thomas Schatzl <tschatzl at openjdk.org> wrote:

>>> > I do think the problem exists with case 2) undone end (no CM needed in the end).
>>> > `CodeCache::_gc_epoch` is incremented before the YC and with the current version of the pr again when the CM start is undone after YC at the same safepoint. This can happen a couple of times and is_unloading() might start returning true on nmethods in loom StackChunks. That would be not good as they do have frames.
>>> 
>>> Isn't it the case though that the CM is undone because the YC already visited all the objects? Then when the _gc_epoch is 
>> 
>> The reason why the CM is undone because the heap occupancy after gc is lower than the threshold to start the initial marking at the end of gc.
>> The undo has nothing to do with having visited a certain set of objects.
>> 
>>> incremented at the end where we call the undo hook, the implication is that we have finished the full collection, and hence nmethods that haven't been marked as maybe on-stack, can't be on-stack (virtually or physically). That assumption seems to hold true, since the YC traversed all the stack chunks in the young generation, and there aren't any in the old generation. Or is there a situation where we undo, even though there is more than one object in the old generation that we have not visited?
>> 
>> The undo does not take (and never took) whether objects have been visited or not or whether there are stack chunks in the old generation into account. It makes that decision purely on heap occupancy.
>
>> > incremented at the end where we call the undo hook, the implication is that we have finished the full collection, and hence nmethods that haven't been marked as maybe on-stack, can't be on-stack (virtually or physically). That assumption seems to hold true, since the YC traversed all the stack chunks in the young generation, and there aren't any in the old generation. Or is there a situation where we undo, even though there is more than one object in the old generation that we have not visited?
>> 
>> The undo does not take (and never took) whether objects have been visited or not or whether there are stack chunks in the old generation into account. It makes that decision purely on heap occupancy.
> 
> So things are broken :(
> 
> (Random 1min idea coming up....):
> 
> One option could be decreasing the epoch again, but that will keep alive already in this concurrent start tagged objects.
> What about having a separate "unloading-epoch" and "next-epoch"?
> 
> When determining whether an object can be unloaded, use the unloading epoch, that can be decremented again at the end of marking, while the next-epoch (that is always incremented) is used next time we start the marking for tagging. If we complete the marking, set the unloading-epoch to next-epoch.
> 
> E.g.
> mark start:
>   unloading-epoch = 1
>   next-epoch = 1
> 
>   tag all nmethods with next-epoch ( = 1)
> 
> case 1:
>   we decide to abort the marking
>   decrement unloading-epoch (= 0)
>   increment next-epoch (= 2)
> 
> (since we compare nmethods with unloading-epoch (=0) everything is alive)
> 
> [later]
> 
> mark start
>   unloading-epoch = 0
>   next-epoch = 2
> 
>   tag all nmethods with next-epoch (= 2)
> 
> we finish the marking
>   set unloading epoch to next-epoch (=2)
>   increment next-epoch (=3)
> 
> and so on. This handles an infinite amount of concurrent undos I think.
> 
> As mentioned, a one-minute idea..., not sure it works or is useful at all.

Thanks for the feedback @tschatzl and @fisk !

I've changed the commenting again but didn't push. I thought it should be explained right at the definition of `CodeCache::on_gc_marking_cycle_finish()` why the code cache cycle must not be finished if the heap marking was not finished. I understood it immediately when @fisk reminded me but overlooked it reading code even though I kinda new it.

I've tried to incorporate @tschatzl last comments too.

Last but not least: the current version still fixes the issues after https://bugs.openjdk.org/browse/JDK-8300915

Thanks, Richard.

-------------

PR: https://git.openjdk.org/jdk/pull/12194