RFR: 8069367: assert(_nextMarkBitMap->isMarked((HeapWord*) obj)) failed

Mon Mar 9 09:52:16 UTC 2015

Hi Kim,

  thanks for looking into this issue...

On Fri, 2015-03-06 at 13:10 -0500, Kim Barrett wrote:
> Please review this change to fix a problem in the interaction between
> G1 concurrent marking and eager reclaim of humongous objects.
> 
> I will need a sponsor for this change.

I can sponsor it.

> 
> The scenario we are dealing with is
> 
> (1) A humongous object H is marked and added to the mark stack.
> 
> (2) An evacuation pause determines H is no longer live, and reclaims
> it.  This occurs before concurrent marking has gotten around to
> processing the mark stack entry for H.
> 
> (3) Concurrent marking processes the mark stack entry for H,
> attempting to scan the now dead object.
> 
> The approach being taken to solve this is to check each mark stack
> entry as it is about to be scanned, to filter out and discard stale
> entries for dead humongous objects.
> 
> The filter being used tests whether the entry appears to be for an
> object that was allocated during the concurrent mark cycle, by
> comparing the "object" against the associated region's
> top-at-mark-start (TAMS) value.
> 
> Normal marking filters out such recent objects and doesn't mark them
> because G1 allocates black, so there is no need to scan such objects.
> As a result, there normally aren't any such objects in the mark stack.
> 
> When a humongous object is eagerly reclaimed, the associated start
> region has its TAMS reset to the region bottom.  Even if the region is
> later (during the same concurrent mark cycle) reallocated, its TAMS
> value will remain fixed at region bottom.
> 
> As a result, a mark stack entry not below the containing region's TAMS
> must be a stale entry for a reclaimed humongous object.
> 
> Note that automated regression testing for this problem is hard; even
> a stress test with a high rate of humongous object allocation and
> discard can take a long time to trip over this situation. Manual
> stress testing with additional VM instrumentation has verified the
> occurrence of the described scenario.
> 
> The additional test in concurrent marking imposes a small performance
> degradation on concurrent marking.  Measurements of a program which
> allocates a substantial number of objects and then does nothing but
> repeatedly GC shows a fraction of a percent increase in concurrent
> mark time, which is well within the variance for even this contrived
> test.  Aurora performance comparison showed no significant negative
> impact.  Alternatives that preclean the mark stack when humongous
> objects are reclaimed get complicated when attempting to do so without
> extending the reclaiming evacuation pause.
> 
> CR:
> https://bugs.openjdk.java.net/browse/JDK-8069367
> 
> Webrev:
> http://cr.openjdk.java.net/~kbarrett/8069367/webrev.00/

- concurrentMark.inline.hpp: in is_stale_humongous_queue_entry(), in the
long comment, could the change add some text that mentions that "Since
the humongous object has already been reclaimed, we can only check that
its base address is region aligned.".

- please improve the asserts by adding the failing addresses in question
(and region index). This helps diagnosing issues.

- it's "humongous object" and not "humongous" too in the assert.

- concurrentMark.hpp: the _stale_humongous_queue_entries should be a
size_t. Sangheon has reviews for the patch that changes the other
related members from int's already. So maybe wait until it has been
pushed for posting the new webrev.

- concurrentMark.cpp: what do you think about refactoring the 

  is_stale_humongous_queue_entry(obj) {
    // ignore
  } else {
    // do some asserts
    scan_object();
  }

into an extra method? This would allow unification of the asserts too.
The "assert(...->is_marked(obj), ...)" is duplicated in
CMTask::do_marking_step(), scan_object() already does that. Also the
master_free_list assert could be moved into the scan_object() method I
think. I cannot see a case where this would not be the case.

Also I would personally prefer to switch the if- and else- parts of that
(invert the condition). When reading if-statements, I somewhat expect
the far common case of execution to not be put into the else-part. It's
your call.

Please add the test case used to reproduce that explicitly enables eager
reclamation. While failure without the change is not 100% reproducable,
it seems good enough.

Thanks,
  Thomas