RFR(S): 8009536: G1: Apache Lucene hang during reference processing
John Cuthbertson
john.cuthbertson at oracle.com
Tue Mar 12 22:00:27 UTC 2013
Hi Everyone,
Here's a new webrev based upon comments from Bengt and Thomas.
http://cr.openjdk.java.net/~johnc/8009536/webrev.1/
This webrev includes the just the changes to resolve the hang seen by
overflowing the marking stack during *serial* reference processing. As I
said in my response to Bengt, this revision will produce the following
assert:
> [junit4:junit4] 80.722: [GC remark 80.723: [GC ref-proc80.785: [GC
> concurrent-mark-reset-for-overflow]
> [junit4:junit4] # To suppress the following error report, specify this
> argument
> [junit4:junit4] # after -XX: or in .hotspotrc:
> SuppressErrorAt=/concurrentMark.cpp:809
> [junit4:junit4] #
> [junit4:junit4] # A fatal error has been detected by the Java Runtime
> Environment:
> [junit4:junit4] #
> [junit4:junit4] # Internal Error
> (/export/workspaces/8009536_3/src/share/vm/gc_implementation/g1/concurrentMark.cpp:809),
> pid=16314, tid=14
> [junit4:junit4] # assert(_finger == _heap_end) failed: only way to
> get here
> [junit4:junit4] #
> [junit4:junit4] # JRE version: Java(TM) SE Runtime Environment
> (8.0-b79) (build 1.8.0-ea-fastdebug-b79)
> [junit4:junit4] # Java VM: Java HotSpot(TM) Server VM
> (25.0-b23-internal-jvmg mixed mode solaris-x86 )
> [junit4:junit4] # Core dump written. Default location:
> /export/bugs/8009536/lucene-5.0-2013-03-05_15-37-06/build/analysis/uima/test/J0/core
> or core.16314
> [junit4:junit4] #
> [junit4:junit4] # An error report file with more information is saved as:
> [junit4:junit4] #
> /export/bugs/8009536/lucene-5.0-2013-03-05_15-37-06/build/analysis/uima/test/J0/hs_err_pid16314.log
> [junit4:junit4] #
> [junit4:junit4] # If you would like to submit a bug report, please visit:
> [junit4:junit4] # http://bugreport.sun.com/bugreport/crash.jsp
> [junit4:junit4] #
> [junit4:junit4] Current thread is 14
> [junit4:junit4] Dumping core ...
when run with parallel reference processing enabled. That fix will be
sent out shortly.
JohnC
On 3/11/2013 2:35 PM, John Cuthbertson wrote:
> Hi Everyone,
>
> Can I have a couple of volunteers review these changes? The webrev can
> be found at: http://cr.openjdk.java.net/~johnc/8009536/webrev.0/.
>
> First of all - many thanks to Uwe Schindler for discovering an
> reporting the problem and providing very clear instructions on how to
> reproduce the issue. Many thanks also Dawid Weiss for also stepping in
> with a self-contained reproducer.
>
> I also wish to thank Bengt for his help. It was Bengt who gave me the
> magic proxy formula that allowed Ivy to satisfy and download all the
> dependencies for the test case. Bengt also diagnosed the problem and
> gave an initial fix (which the changes in the webrev are based upon).
>
> Summary:
> During the remark pause, the execution of the parallel RemarkTask set
> the number of workers thread in the ParallelTaskTerminator and the
> first and second barrier syncs. During serial reference processing,
> the marking stack overflowed causing the single (VMThread) thread to
> enter the overflow handling code in CMTask::do_marking_step(). This
> overflow code using two WorkBarrierSyncs to synchronize the threads
> before resetting the marking state for restarting marking. The barrier
> syncs were waiting for the number of threads that participated in the
> RemarkTask) but, since only the VM thread was executing, only a single
> thread entered the barrier - resulting in the barrier indefinitely
> waiting for the other (non existent) threads.
>
> A proposed solution was to call set_phase to reset the number of
> threads in the parallel task terminator and the barriers to the number
> of active threads for the reference processing. This solution ran into
> a similar hang while processing the JNI references with parallel
> reference processing enabled. (In parallel reference processing, the
> JNI references are processed serially by the calling thread).
> Resetting the phase to single-threaded before processing the JNI refs
> solved the second hang but resulted in an assertion failure: only a
> concurrentGC thread can enter a barrier sync and the calling thread
> was the VM thread.
>
> Furthermore another problem was discovered. If the marking state is
> reset, a subsequent call to set_phase() will assert as the global
> finger has been set to start of the heap. This was a discovered by the
> marking stack overflowing during the RemarkTask and parallel reference
> processing calling set_phase() to reinitialize number of workers in
> the parallel task terminator. It was also discovered when trying out
> another proposed solution: adding a start_gc closure to reference
> processing which would call set_phase() before each processing phase.
> As a result the marking state is only reset by worker 0 if an overflow
> occurs during the concurrent phase of marking; if an overflow occurs
> during remark, reference processing is skipped, and the marking state
> is reset by the VM thread. Resetting the marking state before
> reference processing was a benign error (objects would be marked but
> not pushed on to the stack as they were no longer below the finger;
> the objects would then be traced, in the normal fashion, when marking
> restarted) but it's better to safe than sorry. The other part of the
> fix for this secondary problem is that the parallel reference
> processing task executor now calls the terminator's reset_for_reuse()
> routine instead of set_phase().
>
> The resulting solution for the hang is based upon the patch sent out
> by Bengt - namely we do not enter the sync barriers when
> CMTask::do_marking_step() is being called serially. The difference is
> that I added an extra parameter to CMTask::do_marking_step() instead
> of piggy-backing on the existing parameter list. Additionally, if this
> new parameter indicates serial operation, the current thread will skip
> offering termination. This allows the serial drain closure to enter
> the termination protocol and execute the guarantees contained therein.
>
> The other changes are for the secondary issues, described above, that
> were discovered while out trying other possible solutions.
>
> Testing:
> The lucene test case with serial reference processing (with and
> without verification); the lucene test case with parallel reference
> processing (with and without verification).
> GC test suite with a mark stack size of 1K and 4K, with both serial
> and parallel reference processing (with and without verification).
More information about the hotspot-gc-dev
mailing list