RFR(S): 8005032: G1: Cleanup serial reference processing closures in concurrent marking

Wed Jan 30 17:51:37 UTC 2013

On 1/30/2013 5:21 AM, Bengt Rutisson wrote:
>
> Hi Charlie,
>
> On 1/17/13 9:52 PM, Charlie Hunt wrote:
>> John / Bengt:
>>
>> I think I can offer a bit of info on Bengt's earlier question about 
>> ParallelProcRefEnabled being disabled by default.
>>
>> IIRC, there was one workload that showed a slight perf regression 
>> with +ParallelProcRefEnabled.  That workload that showed a regression 
>> may not be as relevant as it was back when the evaluation / decision 
>> was made to disable it by default?
>
> Thanks for providing some history for these flags!
>
>> You both have probably thought about this already?  My reaction is 
>> ... I think reasonable defaults would be to enable 
>> +ParallelProcRefEnabled for Parallel[Old], CMS and G1 when 
>> ParallelGCThreads is greater than 1, and disable 
>> -ParallelProcRefEnabled with -XX:+UseSerialGC.
>
> This sounds like a good enhancement. John, if you agree, could you 
> file a CR for it? Or would you like me to file it?

I think  the regression was with benchmarks that did not have much 
Reference processing
but still had to pay the build up/tear down cost for the parallel work.  
Andrew did the work and
I think there was lots of performance data that went into turning it off 
by default.

Jon

>
> Bengt
>
>>
>> hths,
>>
>> charlie ...
>>
>> On Jan 17, 2013, at 3:02 PM, John Cuthbertson wrote:
>>
>>> Hi Bengt,
>>>
>>> There's a new webrev at: 
>>> http://cr.openjdk.java.net/~johnc/8005032/webrev.1/
>>>
>>> It looks larger than the previous webrev but the most of the change was
>>> tweaking comments. The actual code changes are smaller.
>>>
>>> Testing was the same as before.
>>>
>>> On 1/15/2013 1:18 AM, Bengt Rutisson wrote:
>>>> I see. I didn't think about the difference betweeen ParallelGCThreads
>>>> and ParallelRefProcEnabled. BTW, not part of this change, but why do
>>>> we have ParallelRefProcEnabled? And why is it false by default?
>>>> Wouldn't it make more sense to have it just be dependent on
>>>> ParallelGCThreads?
>>> I don't know and the answer is probably lost in the dark depths of time
>>> - I can only speculate. For G1 we have a CR to turn
>>> ParallelRefProcEnabled on if the number of GC threads > 1. I'm not sure
>>> about the other collectors.
>>>
>>>>> Setting it once in weakRefsWork() will not be sufficient. We will run
>>>>> into an assertion failure in
>>>>> ParallelTaskTerminator::offer_termination().
>>>>>
>>>>> During the reference processing, the do_void() method of the
>>>>> complete_gc oop closure (in our case the complete gc oop closure is
>>>>> an instance of G1CMParDrainMarkingStackClosure) is called multiple
>>>>> times (in process_phase1, sometimes process_phase2, process_phase3,
>>>>> and process_phaseJNI)
>>>>>
>>>>> Setting the phase sets the number of active tasks (or threads) that
>>>>> the termination protocol in do_marking_step() will wait for. When an
>>>>> invocation of do_marking_step() offers termination, the number of
>>>>> tasks/threads in the terminator instance is decremented. So Setting
>>>>> the phase once will let the first execution of do_marking_step (with
>>>>> termination) from process_phase1() succeed, but subsequent calls to
>>>>> do_marking_step() will result in the assertion failure.
>>>>>
>>>>> We also can't unconditionally set it in the do_void() method or even
>>>>> the constructor of G1CMParDrainMarkingStackClosure. Separate
>>>>> instances of this closure are created by each of the worker threads
>>>>> in the MT-case.
>>>>>
>>>>> Note when processing is multi-threaded the complete_gc instance used
>>>>> is the one passed into the ProcessTask's work method (passed into
>>>>> process_discovered_references() using the task executor instance)
>>>>> which may not necessarily be the same complete gc instance as the one
>>>>> passed directly into process_discovered_references().
>>>> Thanks for this detailed explanation. It really helped!
>>>>
>>>> I understand the issue now, but I still think it is very confusing
>>>> that _cm->set_phase() is called from
>>>> G1CMRefProcTaskExecutor::execute() in the multithreaded case and from
>>>> G1CMParDrainMarkingStackClosure::do_void() in the single threaded 
>>>> case.
>>>>
>>>>> It might be possible to record whether processing is MT in the
>>>>> G1CMRefProcTaskExecutor class and always pass the executor instance
>>>>> into process_discovered_references. We could then set processing to
>>>>> MT so that the execute() methods in the executor instance are invoked
>>>>> but call the Proxy class' work method directly. Then we could
>>>>> override the set_single_threaded() routine (called just before
>>>>> process_phaseJNI) to set the phase.
>>>> I think this would be a better solution, but if I understand it
>>>> correctly it would mean that we would have to change all the
>>>> collectors to always pass a TaskExecutor. All of them currently pass
>>>> NULL in the non-MT case. I think it would be simpler if they always
>>>> passed a TaskExecutor but it is a pretty big change.
>>> I wasn't meaning to do that for the other collectors just G1's
>>> concurrent mark reference processor i.e. fool the ref processor into
>>> think it's MT so that the parallel task executor is used but only use
>>> the work gang if reference processing was _really_ MT.
>>>
>>> I decided not to do this as there is an easier way. For the non-MT case
>>> we do not need to enter the termination protocol in
>>> CMTask::do_marking_step(). When there's only one thread we don't 
>>> need to
>>> use the ParallelTaskTerminator to wait for other threads. And we
>>> certainly don't need stealing. Hence the solution is to only do the
>>> termination and stealing if the closure is instantiated for MT 
>>> reference
>>> processing. That removes the set_phase call().
>>>
>>>> Another possibility is to introduce some kind of prepare method to the
>>>> VoidClosure (or maybe in a specialized subclass for ref processing).
>>>> Then we could do something like:
>>>>
>>>>   complete_gc->prologue();
>>>>   if (mt_processing) {
>>>>     RefProcPhase2Task phase2(*this, refs_lists, !discovery_is_atomic()
>>>> /*marks_oops_alive*/);
>>>>     task_executor->execute(phase2);
>>>>   } else {
>>>>     for (uint i = 0; i < _max_num_q; i++) {
>>>>       process_phase2(refs_lists[i], is_alive, keep_alive, 
>>>> complete_gc);
>>>>     }
>>>>   }
>>>>
>>>> G1CMParDrainMarkingStackClosure::prologue() could do the call to
>>>> _cm->set_phase(). And G1CMRefProcTaskExecutor::execute() would not
>>>> have to do it.
>>> The above is a reasonable extension to the reference processing code. I
>>> no longer need this feature for this change but we should submit a CR
>>> for it. I'll do that.
>>>
>>>> BTW, not really part of your change, but above code is duplicated
>>>> three times in ReferenceProcessor::process_discovered_reflist(). Would
>>>> be nice to factor this out to a method.
>>> Completely agree. Again I'll submit a CR for it.
>>>
>>> Thanks,
>>>
>>> JohnC
>