RFR: 8069367: assert(_nextMarkBitMap->isMarked((HeapWord*) obj)) failed

Fri Mar 6 18:10:07 UTC 2015

Please review this change to fix a problem in the interaction between
G1 concurrent marking and eager reclaim of humongous objects.

I will need a sponsor for this change.

The scenario we are dealing with is

(1) A humongous object H is marked and added to the mark stack.

(2) An evacuation pause determines H is no longer live, and reclaims
it.  This occurs before concurrent marking has gotten around to
processing the mark stack entry for H.

(3) Concurrent marking processes the mark stack entry for H,
attempting to scan the now dead object.

The approach being taken to solve this is to check each mark stack
entry as it is about to be scanned, to filter out and discard stale
entries for dead humongous objects.

The filter being used tests whether the entry appears to be for an
object that was allocated during the concurrent mark cycle, by
comparing the "object" against the associated region's
top-at-mark-start (TAMS) value.

Normal marking filters out such recent objects and doesn't mark them
because G1 allocates black, so there is no need to scan such objects.
As a result, there normally aren't any such objects in the mark stack.

When a humongous object is eagerly reclaimed, the associated start
region has its TAMS reset to the region bottom.  Even if the region is
later (during the same concurrent mark cycle) reallocated, its TAMS
value will remain fixed at region bottom.

As a result, a mark stack entry not below the containing region's TAMS
must be a stale entry for a reclaimed humongous object.

Note that automated regression testing for this problem is hard; even
a stress test with a high rate of humongous object allocation and
discard can take a long time to trip over this situation. Manual
stress testing with additional VM instrumentation has verified the
occurrence of the described scenario.

The additional test in concurrent marking imposes a small performance
degradation on concurrent marking.  Measurements of a program which
allocates a substantial number of objects and then does nothing but
repeatedly GC shows a fraction of a percent increase in concurrent
mark time, which is well within the variance for even this contrived
test.  Aurora performance comparison showed no significant negative
impact.  Alternatives that preclean the mark stack when humongous
objects are reclaimed get complicated when attempting to do so without
extending the reclaiming evacuation pause.

CR:
https://bugs.openjdk.java.net/browse/JDK-8069367

Webrev:
http://cr.openjdk.java.net/~kbarrett/8069367/webrev.00/

Testing:
JPRT, Aurora G1 performance test, Aurora Ad-hoc GC Nightly, hand testing