RFR(S): 8009536: G1: Apache Lucene hang during reference processing

John Cuthbertson john.cuthbertson at oracle.com
Wed Mar 13 21:34:56 UTC 2013


Hi Bengt,

Thanks. Enjoy the long weekend!

JohnC

On 3/13/2013 2:01 PM, Bengt Rutisson wrote:
>
> Hi John,
>
> Thanks for explaining a bit about the serial case where we still need 
> to go through termination protocol.
>
> As you know I'll be on vacation the next two days, so I just wanted to 
> let you know that with the changes that you outline below (the fix to 
> CMRemarkTask and removing the do_stealing argument) I'm fine with this 
> change. Feel free to push this if you get another review.
>
> Thanks,
> Bengt
>
> On 3/13/13 6:57 PM, John Cuthbertson wrote:
>> Hi Bengt,
>>
>> Thanks for looking at the change again. Replies inline....
>>
>> On 3/13/2013 3:22 AM, Bengt Rutisson wrote:
>>>
>>>
>>> Hi John,
>>>
>>> I'm not too familiar with the do_marking_step() code, so I'm trying 
>>> to get to know it while I'm looking at your webrev. So, maybe I'm 
>>> completely off here, but I think there are a couple of strange things.
>>>
>>> After your change do_marking_step() takes three bools: do_stealing, 
>>> do_termination and is_serial.
>>>
>>> This means that we have 8 possible states if we combine all value of 
>>> these bools. I went through all calls to do_marking_step() and it 
>>> seems like we make use of 4 of those states:
>>>
>>>
>>> G1CMKeepAliveAndDrainClosure::do_oop_work()
>>>   do_marking_step(false, false, true)             (1)
>>>
>>> CMConcurrentMarkingTask::work()
>>>   do_marking_step(true, true, false)              (2)
>>>
>>> CMRemarkTask::work()
>>>   do_marking_step(true, true, false)              (2)
>>>   do_marking_step(true, true, true)               (3)
>>>
>>> G1CMDrainMarkingStackClosure::do_void()
>>>   do_marking_step(false, true, true)              (4)
>>>   do_marking_step(true, true, false)              (2)
>>>
>>
>> Yeah. Well...
>>
>>>
>>> I numbered the states (1) - (4). Here are all states and I marked 
>>> which ones we use:
>>>
>>>  stealing,  termination,  serial    (3)
>>>  stealing, !termination,  serial
>>>  stealing, !termination, !serial
>>>  stealing,  termination, !serial    (2) parallel
>>> !stealing,  termination,  serial    (4)
>>> !stealing, !termination,  serial    (1) serial
>>> !stealing, !termination, !serial
>>> !stealing,  termination, !serial
>>>
>>>
>>> I think state (2) makes sense as the parallel case. We do stealing 
>>> and termination and we are not serial. State (1) also makes sense. 
>>> It is clearly the serial case. No stealing or termination. Just serial.
>>>
>>> Now, case (3) looks odd to me. We say we are serial but we want to 
>>> do stealing and termination. This case comes from CMRemarkTask. It 
>>> takes a is_serial argument in its constructor. We get to case (3) 
>>> from ConcurrentMark::checkpointRootsFinalWork(), where we do:
>>>
>>>   if (G1CollectedHeap::use_parallel_gc_threads()) {
>>>     ...
>>>   } else {
>>>     ...
>>>     CMRemarkTask remarkTask(this, active_workers, true /* is_serial*/);
>>>   }
>>>
>>> To me this looks like it should lead to the pure serial case (1) 
>>> instead of (3). Do you agree? In that case I think 
>>> CMRemarkTask::work() needs to be changed to do:
>>>
>>>         task->do_marking_step(1000000000.0 /* something very large */,
>>>                               !_is_serial /* do_stealing */,
>>>                               !_is_serial /* do_termination */,
>>>                               _is_serial);
>>>
>>
>> You're right here. The serial remark case should not enable stealing. 
>> It doesn't make sense. Done!
>>
>>>
>>> The other case that sticks out is (4). We don't want to do stealing, 
>>> but we want to do termination even though we are serial. This comes 
>>> from how we set up the g1_drain_mark_stack in 
>>> ConcurrentMark::weakRefsWork(). We pass that along to the reference 
>>> processing, which I think is serial (wasn't that the original 
>>> problem? :) ). We also pass the g1_keep_alive which is set up to be 
>>> in state (1). So, I wonder if it is correct that the 
>>> g1_drain_mark_stack is in state (4). If you agree that we should be 
>>> using the pure serial case here too, I think we need to change 
>>> G1CMDrainMarkingStackClosure::do_void() to be:
>>>
>>>       _task->do_marking_step(1000000000.0 /* something very large */,
>>>                              !_is_serial /* do_stealing    */,
>>>                              !_is_serial /* do_termination */,
>>>                              _is_serial  /* is_serial */);
>>>
>>>
>>
>> That is what we had before and I wasn't happy with it. The serial 
>> drain closure is the last thing that get's executed as part of 
>> reference processing - when process_phaseJNI invokes the 
>> complete_gc.do_void() method. In the drain closure either serial or 
>> parallel I think we want to go through the termination protocol. We 
>> just don't want to interact with the terminator in the serial case. 
>> There's a bunch of guarantees and checks in the termination protocol 
>> and we should still go through them - even when serial.
>>
>>>
>>> Again, I'm not sure about this, but if the above is correct we will 
>>> be left with only two states. The parallel and the serial. In that 
>>> case I think it would make sense to remove the do_stealing and 
>>> do_termination arguments for do_marking_step() and just pass in the 
>>> is_serial argument. Internally do_marking_step() could set up two 
>>> local variable to track stealing and termination. That might be good 
>>> if we see the need in the future to split these up again.
>>
>> I think we can get rid of the do_stealing parameter - it's always the 
>> negation of the is_serial parameter and I like the idea of making it 
>> a local. How about that? For example
>>
>> void CMTask::do_marking_step(double ms, bool do_termination, bool 
>> is_serial) {
>>   // It does not make sense to allow stealing when called serially.
>>   bool do_stealing = !is_serial;
>>
>> Thanks,
>>
>> JohnC
>




More information about the hotspot-gc-dev mailing list