RFR(S): 8009536: G1: Apache Lucene hang during reference processing

Tue Mar 12 10:13:37 UTC 2013

Hi,

On Mon, 2013-03-11 at 14:35 -0700, John Cuthbertson wrote:
> Hi Everyone,
> 
> Can I have a couple of volunteers review these changes? The webrev can 
> be found at: http://cr.openjdk.java.net/~johnc/8009536/webrev.0/.
> 

- maybe in concurrentMark.cpp you can now replace

992:   reset_marking_state(concurrent() /* clear_overflow */);

by

reset_marking_state(true /* clear_overflow */);

As the whole block is guarded by that. Not sure if it is better, but
when reading the code with that change you don't need to think what the
value of concurrent() is.

Keeping it is as fine.

- I'm slightly confused by using the "serial"
G1CMKeepAliveAndDrainClosure and G1CMDrainMarkingStackClosure closures
(that don't synchronize with other threads) in weakRefsWork() at line
2465 and 2466. Or at least the comment above it.

So, this "is_serial" flag indicates to the do_marking_step() that it
should not try to synchronize with concurrent mark threads when
terminating, as there are none.

There are still the parallel work gang threads using these "is_serial"
closures for processing (by the partaskexecutor used in
process_discovered_references()).

Following the code, in the parallel case, if a thread spawned by
weakRefsWork() fails "fails" due to mark stack overflow, it does not try
to synchronize with others. Mainly because the work gang it is running
in (in Workgang::run_task()) that waits for all threads anyway.

(Couldn't this synchronization potentially take a long time as e.g. one
thread is able to completely finish work using local buffers as others
are waiting on it?)

Then, regardless of parallelism, two (or three) caller levels above in
ConcurrentMarkThread::run(), after the vm thread finishes the safepoint,
the current code will notice that there has been an overflow during
remark and do another safepoint.

(Is there a reason to schedule another safepoint to recover from a
marking overflow? This involves a back-to-back safepoint. Is this
cheaper than just retrying in the current safepoint?)

What do you think of renaming that flag? Maybe something like
"synchronize_with_cm_threads". Unfortunately I cannot come up with a
better, more generic name, as do_marking_step() does exactly that, and
that's what the change wants to avoid (well, it has the flag the other
way around).

The problem seems to be that do_marking_step() as is now is too much
geared towards use in the concurrent marking threads, it uses a lot of
details from it. And the current code reuses do_marking_step() although
it's not running in the marking threads.

So another option could be to extract the concurrent marking thread
specific stuff (e.g. termination) from do_marking_step() into a helper
class/interface and pass a suitable implementation for each of its uses.
Instead of the many _cm-> references in do_marking_step, call
appropriate methods of that. But that would be quite some additional
work... (and I didn't think it through, just a thought)

Anyway, could the comment above these lines changed as the current
comment indicates (to me) that the closures are run by the vmthread.

E.g. instead of

2463:// Serial (i.e executed by VMThread) instances of the 'Keep Alive'
2464:// and 'Complete GC' closures.

use something like:

// These closures do not need to synchronize with concurrent marking
// threads as this code is either run serially by this thread (e.g. 
// processing_is_mt is false which uses the VM thread), or the worker
// gang tasks which have their own synchronization.

- one other minor thing in concurrentMark.cpp:

4317:    bool finished = (is_serial ? true
                               :_cm->terminator()->offer_termination(this));

Maybe change this to

bool finished = (is_serial ||
_cm->terminator()->offer_termination(this));

While this is no different, I actually had to pause and think what the
intent of the use of the "?" operator was here :) It seems easier to
read and more common to to not use the "?" operator in such cases.

Thomas