RFR: 8069367: assert(_nextMarkBitMap->isMarked((HeapWord*) obj)) failed

Mon Mar 9 11:49:36 UTC 2015

Hi Kim,

On 2015-03-06 19:10, Kim Barrett wrote:
> Please review this change to fix a problem in the interaction between
> G1 concurrent marking and eager reclaim of humongous objects.
>
> I will need a sponsor for this change.
>
> The scenario we are dealing with is
>
> (1) A humongous object H is marked and added to the mark stack.
>
> (2) An evacuation pause determines H is no longer live, and reclaims
> it.  This occurs before concurrent marking has gotten around to
> processing the mark stack entry for H.
>
> (3) Concurrent marking processes the mark stack entry for H,
> attempting to scan the now dead object.
>
> The approach being taken to solve this is to check each mark stack
> entry as it is about to be scanned, to filter out and discard stale
> entries for dead humongous objects.
>
> The filter being used tests whether the entry appears to be for an
> object that was allocated during the concurrent mark cycle, by
> comparing the "object" against the associated region's
> top-at-mark-start (TAMS) value.
>
> Normal marking filters out such recent objects and doesn't mark them
> because G1 allocates black, so there is no need to scan such objects.
> As a result, there normally aren't any such objects in the mark stack.
>
> When a humongous object is eagerly reclaimed, the associated start
> region has its TAMS reset to the region bottom.  Even if the region is
> later (during the same concurrent mark cycle) reallocated, its TAMS
> value will remain fixed at region bottom.
>
> As a result, a mark stack entry not below the containing region's TAMS
> must be a stale entry for a reclaimed humongous object.
>
> Note that automated regression testing for this problem is hard; even
> a stress test with a high rate of humongous object allocation and
> discard can take a long time to trip over this situation. Manual
> stress testing with additional VM instrumentation has verified the
> occurrence of the described scenario.
>
> The additional test in concurrent marking imposes a small performance
> degradation on concurrent marking.  Measurements of a program which
> allocates a substantial number of objects and then does nothing but
> repeatedly GC shows a fraction of a percent increase in concurrent
> mark time, which is well within the variance for even this contrived
> test.  Aurora performance comparison showed no significant negative
> impact.  Alternatives that preclean the mark stack when humongous
> objects are reclaimed get complicated when attempting to do so without
> extending the reclaiming evacuation pause.

Thanks for providing such a detailed descriptions about the problem and 
solution!

One question. I assume that this situation can only occur if the 
humongous object was live before the marking started (otherwise it would 
have already been filtered out since it would have TAMS == bottom) and 
someone has removed the reference to the humongous object while we were 
marking.

Here's an attempt to show what I mean in a diagram:

H = new Humongous(),;
A.h = H;
<G1 initial mark>
<Marking scans A and pushes H on the mark stack>
A.h = null;
<G1 young GC>
<H is reclaimed since no one references it>
<Marking continues and finds H on the mark stack>

Is this what is happening? In that case, isn't this violating the SATB 
invariant that anything that was live when marking started is considered 
live when it ends? Your fix will make sure the marking doesn't crash, 
but doesn't this behavior (even prior to your fix) cause other problems?

Thanks,
Bengt

>
> CR:
> https://bugs.openjdk.java.net/browse/JDK-8069367
>
> Webrev:
> http://cr.openjdk.java.net/~kbarrett/8069367/webrev.00/
>
> Testing:
> JPRT, Aurora G1 performance test, Aurora Ad-hoc GC Nightly, hand testing
>