RFR(S): G1: assert(_finger == _heap_end) failed, concurrentMark.cpp:809

Wed Mar 13 17:21:07 UTC 2013

Hi Bengt,

Thanks for looking at the code. I'm happy to answer questions.

On 3/13/2013 6:05 AM, Bengt Rutisson wrote:
>
> Hi John,
>
> Thanks for splitting this up into two webrevs! :)
>
> I understand that we only have to reset the marking phase in 
> enter_first_sync_barrier() if we are during concurrent mark since we 
> are anyway doing it in checkpointRootsFinal(). And I think you are 
> correct about the fact that we don't have to drain the discovered 
> references if we have hit an overflow.
>
> So, basically I think I understand the changes for 1. and 3. that you 
> explained below.
>
> I'm struggling a bit to grasp the change motivated by 2. Are these the 
> calls to set_phase() that triggered the assert before? Do we still 
> need to remove them considering that we no longer reset the marking 
> phase during remark?

Yes - these are the calls that triggered the asserts. Here's the stack 
trace:

Stack: [0xd817f000,0xd81ff000], sp=0xd81fe34c, free space=508k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, 
C=native code)
V [libjvm.so+0xdcaeac] void VMError::report(outputStream*)+0xdec
V [libjvm.so+0xdcc205] void VMError::report_and_die()+0x735
V [libjvm.so+0x6af46b] void report_vm_error(const char*,int,const 
char*,const char*)+0x8b
V [libjvm.so+0x65e6a3] void ConcurrentMark::set_phase(unsigned,bool)+0x1a3
V [libjvm.so+0x660534] void 
G1CMRefProcTaskExecutor::execute(AbstractRefProcTaskExecutor::ProcessTask&)+0xc4
V [libjvm.so+0xc5275c] void 
ReferenceProcessor::process_discovered_reflist(DiscoveredList*,ReferencePolicy*,bool,BoolObjectClosure*,OopClosure*,VoidClosure*,AbstractRefProcTaskExecutor*)+0x25c
V [libjvm.so+0xc50af6] void 
ReferenceProcessor::process_discovered_references(BoolObjectClosure*,OopClosure*,VoidClosure*,AbstractRefProcTaskExecutor*)+0x266
V [libjvm.so+0x660975] void ConcurrentMark::weakRefsWork(bool)+0x2d5
V [libjvm.so+0x65f593] void 
ConcurrentMark::checkpointRootsFinal(bool)+0x153
V [libjvm.so+0x6983e8] void CMCheckpointRootsFinalClosure::do_void()+0x18
V [libjvm.so+0xe0ecd1] void VM_CGC_Operation::doit()+0xe1
V [libjvm.so+0xe0b82d] void VM_Operation::evaluate()+0x7d
V [libjvm.so+0xe0938c] void 
VMThread::evaluate_operation(VM_Operation*)+0x9c
V [libjvm.so+0xe09960] void VMThread::loop()+0x4d0
V [libjvm.so+0xe08fc1] void VMThread::run()+0x111
V [libjvm.so+0xbb1490] java_start+0x210
C [libc.so.1+0xa73b7] _thr_setup+0x4e
C [libc.so.1+0xa76a0] __moddi3+0x60

If a stack overflow occurred when executing one ref processing task, set 
the phase in the next would assert.

>
> Also, you replaced the calls to set_phase() with calls to 
> _cm->terminator()->reset_for_reuse(_active_workers). The comment says 
> that this is necessary for the termination protocol in 
> do_marking_step() to work properly. But that method also uses 
> _first_overflow_barrier_sync and _second_overflow_barrier_sync. Don't 
> we have to reset those too? Similarly to what set_phase() does?

AFAICT we don't need to reset the first and second barrier syncs.

ConcurrentMark::set_phase(active_workers,...._) initializes the 
_terminator with a new ParallelTaskTerminator instance with the given 
number of workers; the value of the _offered_termination field is 
zeroed. When a worker offers termination, _offered_terminatation is 
incremented. The only ways to reset the value of _offered_termination is 
to create a new instance (as set_phase does) or call reset_for_reuse():

> void ParallelTaskTerminator::reset_for_reuse() {
>   if (_offered_termination != 0) {
>     assert(_offered_termination == _n_threads,
>            "Terminator may still be in use");
>     _offered_termination = 0;
>   }
> }

to reuse the existing terminator but with new values in the 
_offered_termination and _n_threads fields. Note: reset_for_reuse() is 
not called when the number of threads offering termination reaches the 
number of threads to wait for. We have to do it explicitly.

Compare this with WorkGangBarrierSync. Set_phase() calls set_n_threads() 
for the first and second barriers. This routine sets _n_workers to give 
number of threads and clears _n_completed, _should_reset:

> void WorkGangBarrierSync::set_n_workers(uint n_workers) {
>   _n_workers   = n_workers;
>   _n_completed = 0;
>   _should_reset = false;
> }

When the last thread to enter the barrier sync does so, the value of 
_should_reset is set to true. When do_marking_step is called for the 
next ref processing task, the first to enter the barrier resets the 
value of _n_completed:

> void WorkGangBarrierSync::enter() {
>   MutexLockerEx x(monitor(), Mutex::_no_safepoint_check_flag);
>   if (should_reset()) {
>     // The should_reset() was set and we are the first worker to enter
>     // the sync barrier. We will zero the n_completed() count which
>     // effectively resets the barrier.
>     zero_completed();
>     set_should_reset(false);
>   }

Hence we don't need to explicitly reset the barrier syncs. Make sense?

After answering your question I realize I should be calling the version 
of reset_for_reuse() that takes no parameters. We want to change the 
value of _n_threads in the terminator because it could then be changed 
to a value different from that in the barrier syncs.

> Finally, why does it say "// Not strictly necessary but..." in 
> G1CMRefProcTaskExecutor::execute() ?

It's not necessary because the enqueue task doesn't interact with the 
parallel task terminator (i.e. it doesn't call do_marking_step). It just 
turns around and calls RefProcEnqueueTask::work(). It just increased the 
similarity and symmetry of the two execute() routines and adds a little 
bit of "just in case".

Cheers,

JohnC