RFR: 8305903: Deflate monitors of dead objects before they become unreachable

Wed May 3 08:11:15 UTC 2023

On Wed, 3 May 2023 05:34:50 GMT, Kim Barrett <kbarrett at openjdk.org> wrote:

>> With compact object headers ([JDK-8305895](https://bugs.openjdk.org/browse/JDK-8305895)), I've seen occasional failures in G1, where the refinement thread tries to parse a heap region that has dead objects, and would sometimes see an object with a monitor that has already been deflated. And because deflation does not bother to restore the header of dead objects, when heap iteration tries to load the Klass* of the dead object, it would reach to unknown memory and crash.
>> 
>> In OM::deflate_monitor() we check object_peek(), and if that returns null, then the object header is not updated (and can't be, because the object cannot be reached anymore). Concurrent GCs that process weak handles concurrently ensure that the object doesn't leak out by returning null there (via a barrier). However, for runtime code, at this point, there is no safe way to grab the object and update the header, because the GC might already have reclaimed it. The last safe point in time where we can do that is in WeakProcessor::Task::work() and OopStorage::weak_oops_do() itself, as soon as we detect that the object is properly unreachable.
>> 
>> The fix is to restore the header of dead objects just before they become unreachable. This can be done in the closures used by WeakProcessor::weak_oops_do(), right before the weak root will be cleared.
>> 
>> Notice that this is only a bug with compact object headers. It doesn't hurt to fix this in general, though.
>> 
>> Testing:
>>  - [x] tier1
>>  - [x] tier2
>
> In particular, during concurrent refinement, we're looking at either parsable
> or unparsable parts of a region when processig a card. In the unparsable part,
> we don't look at the dead objects for size information to find object
> boundaries. Instead we use the mark bits to find live objects, ignoring dead
> objects completely. In the parsable part, the dead objects have been
> overwritten with filler objects that it is safe to examine. The dead objects
> are replaced by fillers concurrently, moving the parsable boundry along the
> way. (The replacement by filler objects is an optimization to make card
> scanning faster, since no bitmap searching is required to step over one.) At
> least, that's all how I think it's supposed to work; it's been a while since I
> delved deeply into that code and it's changed somewhat since I last did so.
> 
> So I want to better understand the failures being reported, because what's
> being described doesn't seem like it should happen and may indicate a bug
> elsewhere.

G1 should never scan headers of dead objects, and if so, this is a bug somewhere else. As @kimbarrett mentioned, it is better to find and fix the bug and not paper over it.

https://tschatzl.github.io/2022/08/04/concurrent-marking.html describes the interaction between refinement and class unloading somewhat.

  1) Initially all objects are live and can be walked
  2) G1 decides to unload some classes. From this point on, for all old regions, below TAMS (copied to something we call "parsable bottom", i.e. PB), refinement code exclusively uses the bitmap from the marking for identifying live objects in that area. Objects above PB are all live, and so are parsable always.
  3) Marking threads now scrub the area between a region's bottom and PB, i.e. they put filler objects spanning all dead objects. This makes the heap parsable again.
  4) After finishing that scrubbing for a region, PB is set to bottom. Refinement can walk the heap region completely again - dead objects have been replaced with filler objects.

So in theory refinement threads should never see any dead object. The only issue I can think of may be problems with synchronizing reset of PB with refinement threads that might cause them reading dead objects (i.e. writing of dead objects not completed while PB has been reset).

The matter is different if somebody else keeps references to dead objects and tries to modify headers/contents afterwards; one cause could be that monitor handling modifies headers of dead objects, i.e. what has been replaced by filler objects. Then if that overwrite replaces a filler object header, refinement would obviously happily try to parse that object.

Do you have more information what GC is doing when these errors happen? If between Remark and Cleanup pause, then there might be a problem with that synchronization. If outside, then it's much more likely some change modifying dead areas of the heap (which seems to be an obvious no-can-do to me). In any case please provide more information about these crashes.

Also I do not understand why this would be "only a bug with compact object headers" - can you elaborate? All of the causes for refinement seeing a dead object seem to be quite independent of compact objects to my understanding.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/13721#issuecomment-1532613942