RFR(S): 8009536: G1: Apache Lucene hang during reference processing
Bengt Rutisson
bengt.rutisson at oracle.com
Tue Mar 12 15:05:04 UTC 2013
Hi John,
Thanks for fixing this so quickly!
I have only had a quick look at the change. I'll make sure to look
closer tomorrow. However, I have two questions. If you have time to
answer them it would be good. If you don't have time I hope they become
clear when I look more in detail at the change tomorrow...
First, in the constructor for G1CMDrainMarkingStackClosure() we do:
2285 _do_stealing = !_is_serial;
2286 _do_termination = true;
Just from looking at this it seems like we could get away with only
having one instance variable instead of three. Is that the case or can
any of _do_stealing, _is_serial or _do_termination change during the
life time of a G1CMDrainMarkingStackClosure instance?
Second, as you describe in the text below you are actually fixing two
issues with this patch. Would it make sense to split the changes up into
two changesets?
Thanks,
Bengt
On 3/11/13 10:35 PM, John Cuthbertson wrote:
> Hi Everyone,
>
> Can I have a couple of volunteers review these changes? The webrev can
> be found at: http://cr.openjdk.java.net/~johnc/8009536/webrev.0/.
>
> First of all - many thanks to Uwe Schindler for discovering an
> reporting the problem and providing very clear instructions on how to
> reproduce the issue. Many thanks also Dawid Weiss for also stepping in
> with a self-contained reproducer.
>
> I also wish to thank Bengt for his help. It was Bengt who gave me the
> magic proxy formula that allowed Ivy to satisfy and download all the
> dependencies for the test case. Bengt also diagnosed the problem and
> gave an initial fix (which the changes in the webrev are based upon).
>
> Summary:
> During the remark pause, the execution of the parallel RemarkTask set
> the number of workers thread in the ParallelTaskTerminator and the
> first and second barrier syncs. During serial reference processing,
> the marking stack overflowed causing the single (VMThread) thread to
> enter the overflow handling code in CMTask::do_marking_step(). This
> overflow code using two WorkBarrierSyncs to synchronize the threads
> before resetting the marking state for restarting marking. The barrier
> syncs were waiting for the number of threads that participated in the
> RemarkTask) but, since only the VM thread was executing, only a single
> thread entered the barrier - resulting in the barrier indefinitely
> waiting for the other (non existent) threads.
>
> A proposed solution was to call set_phase to reset the number of
> threads in the parallel task terminator and the barriers to the number
> of active threads for the reference processing. This solution ran into
> a similar hang while processing the JNI references with parallel
> reference processing enabled. (In parallel reference processing, the
> JNI references are processed serially by the calling thread).
> Resetting the phase to single-threaded before processing the JNI refs
> solved the second hang but resulted in an assertion failure: only a
> concurrentGC thread can enter a barrier sync and the calling thread
> was the VM thread.
>
> Furthermore another problem was discovered. If the marking state is
> reset, a subsequent call to set_phase() will assert as the global
> finger has been set to start of the heap. This was a discovered by the
> marking stack overflowing during the RemarkTask and parallel reference
> processing calling set_phase() to reinitialize number of workers in
> the parallel task terminator. It was also discovered when trying out
> another proposed solution: adding a start_gc closure to reference
> processing which would call set_phase() before each processing phase.
> As a result the marking state is only reset by worker 0 if an overflow
> occurs during the concurrent phase of marking; if an overflow occurs
> during remark, reference processing is skipped, and the marking state
> is reset by the VM thread. Resetting the marking state before
> reference processing was a benign error (objects would be marked but
> not pushed on to the stack as they were no longer below the finger;
> the objects would then be traced, in the normal fashion, when marking
> restarted) but it's better to safe than sorry. The other part of the
> fix for this secondary problem is that the parallel reference
> processing task executor now calls the terminator's reset_for_reuse()
> routine instead of set_phase().
>
> The resulting solution for the hang is based upon the patch sent out
> by Bengt - namely we do not enter the sync barriers when
> CMTask::do_marking_step() is being called serially. The difference is
> that I added an extra parameter to CMTask::do_marking_step() instead
> of piggy-backing on the existing parameter list. Additionally, if this
> new parameter indicates serial operation, the current thread will skip
> offering termination. This allows the serial drain closure to enter
> the termination protocol and execute the guarantees contained therein.
>
> The other changes are for the secondary issues, described above, that
> were discovered while out trying other possible solutions.
>
> Testing:
> The lucene test case with serial reference processing (with and
> without verification); the lucene test case with parallel reference
> processing (with and without verification).
> GC test suite with a mark stack size of 1K and 4K, with both serial
> and parallel reference processing (with and without verification).
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/hotspot-gc-dev/attachments/20130312/9549f90d/attachment.htm>
More information about the hotspot-gc-dev
mailing list