RFR(M) 8043575: Dynamically parallelize reference processing work

Thu Jun 7 14:12:09 UTC 2018

Hi Sangheon,

On Wed, 2018-06-06 at 23:38 -0700, sangheon.kim at oracle.com wrote:
> Hi Kim,
> 
> Thanks for the review and sorry for replying late.
> I had to spend some time to figure out complete_gc closure stuff and
> re-running all tests again.

I took these .2 changes and started merging it with the JDK-8202845
changes.

As I suspected, the change gets a lot smaller and nicer with that. More
about that below.

> 
> On 6/4/18 5:11 PM, Kim Barrett wrote:
> > > On Jun 1, 2018, at 5:48 PM, sangheon.kim at oracle.com wrote:
> > > 
> > > Hi all,
> > > 
> > > As webrev.0 is conflicting with webrev.0 of "8203319: JDK-8201487 
> > > disabled too much queue balancing"(out for review, but not yet
> > > pushed), I'm posting webrev.1.
> > > 
> > > http://cr.openjdk.java.net/~sangheki/8043575/webrev.1
> > > http://cr.openjdk.java.net/~sangheki/8043575/webrev.1_to_0
> > 
> > The hookup of the changes into the various collectors seems
> > okay.  My comments are mostly focused on the ReferenceProcessor
> > changes.
> > 

CMS is okay with latest webrev, G1 looks it too. However there are
issues with parallel. From what I understand for all workers that are
"idle", you need to put "idle" tasks in the queue.
Parallel always seems to start all non-idle ("working" on an IdleTask)
worker threads all the time.
So in case with this change you want select a lower number of threads,
you first need to start all threads with some amount of IdleTask in the
queue (there are some helper methods in GCTaskManager). If you want a
higher number of threads, you need to release all idle workers, and
then prepopulate the task queue with a new amount of IdleTasks and the
actual worker tasks you want.

Overall I think we should defer this feature for parallel gc too.

> > -----------------------------------------------------------------
> > ------------- 
> > src/hotspot/share/gc/shared/referenceProcessorPhaseTimes.hpp
> >  140   void set_phase_number(RefProcPhaseNumbers phase_number) {
> > _phase_number = phase_number; }
> >  141   RefProcPhaseNumbers phase_number() const { return
> > _phase_number; }
> > 
> > I think I dislike these and how they are used.  And see other
> > discussion below, suggesting they might not be needed.
> > 
> > (Though they are similar to other related states. But maybe I
> > dislike those too.)
>  I can avoid using it as you suggested below.
>
> But it is still needed at other locations in
> referenceProcessorPhaseTimes.cpp. If you are saying we need to avoid
> using this stuffs, I think it is out of scope for this patch. So,
> please file a CR.
> 

I think most of these issues are kind of resolved with the JDK-8202845
changes; it contains a lot of refactoring of that code.

> > -----------------------------------------------------------------
> > ------------- 
> > src/hotspot/share/gc/shared/referenceProcessor.hpp
> >  641 class RefProcMTDegreeAdjuster : public StackObj {
> > ...
> >  649   RefProcMTDegreeAdjuster(ReferenceProcessor* rp,
> > ReferenceProcessorPhaseTimes* times, size_t ref_count);
> > 
> > The times argument is here to provide information about what phase
> > we're in.  That seems really indirect, and I'd really prefer that
> > information to be provided directly as arguments here.
>  Webrev.2 uses ReferenceType and RefProcPhaseNumbers directly, please
> have a chance to compare with previous version.
> 
> > I think that might also remove the need for the new phase_number
> > and set_phase_number functions for the phase times class.
>  As I noted above, those are still used in other locations.
> 
> > This also would flow through to ergo_proc_thread_count and
> > use_max_threads.
>  Done.
> 
> > -----------------------------------------------------------------
> > ------------- 
> > src/hotspot/share/gc/shared/referenceProcessor.cpp
> >  727   bool must_balance = mt_processing &&
> > need_balance_queues(refs_lists);
> > 
> > need_balance_queues returns the right answer for an initial
> > balancing, but not for a re-balance after some
> > processing.  Specifically, further reducing the mt-degree may
> > require rebalancing, even if it wasn't previously needed.
> > 
> > (This comment is assuming the updated version of
> > need_balance_queues that I sent out today.  The situation is worse
> > without that.)
>  Done.
> 
> > -----------------------------------------------------------------
> > -------------
> > src/hotspot/share/gc/shared/referenceProcessor.cpp 
> >  787   // For discovery_is_atomic() is true, phase 3 should be
> > processed to call do_void() from VoidClosure.
> >  788   // For discovery_is_atomic() is false, phase 3 can be
> > skipped if there are no references because do_void() is
> >  789   // already called at phase 2.
> >  790   if (!discovery_is_atomic()) {
> >  791     if (total_count(refs_lists) == 0) {
> >  792       return;
> >  793     }
> >  794   }
> > 
> > The discovery_is_atomic stuff is left-over from before JDK-8203028.
> > Phase2 now never calls the complete_gc closure, because it is never
> > needed, regardless of the value of discovery_is_atomic.  So
> > checking that can be removed here.
>  Here are 2 things.
> 1) If we call complete_gc closure before exit, is it okay?
>     - The short answer is no, as we face an assertion failure at CMS
> ParNew. The long answer is: For ParallelRefProcEnabled=true case, CMS
> ParNew is relying on to do some extra work by AbstractExecutor. So 

:)

> eventually ParScanThreadState::_young_old_boundary is set and then
> ParEvacuateFollowersClosure is used as a complete_gc closure. Just
> calling complete_gc closure that passed as a parameter doesn't help
> as it is EvacuateFollowersClosureGeneral<ScanClosure,
> ScanClosureWithParBarrier>. i.e. complete_gc closure is different at
> process_discovered_references() vs. phase3, ProcessTask::work().

Let's just look at the JDK-8202845 changes which seem to make this
issue non-existent. Also there is no CMS support, so there should be no
changes.

> 2) Why 'ref_count' is not updated and then line 800 is just using old
> 'ref_count'?
>     - If we use different number of workers(not exit at zero ref
> cases), we face a few types of assertion failures. IIRC, faced
> SIGSEGV too at copy_to_survivor_space().
>     - Each collector has own per worker state information, so I think
> it will not properly updated if we use different number of workers
> between phase2 and phase3. 
>     - This could be applied between phase1 and phase2, but I couldn't
> see any problem so far. So I left as is the exit path between phase1
> and phase2. An alternative would be excluding exit among phases and
> address it from the CR. Previously there were another CR of JDK-
> 8181214.

... and this one too.

> * My previous version allowed only !discovery_is_atomic() cases would
> exit. With that condition, most cases are filtered out so CMS old gc
> and G1 CM would get the benefit. I never faced any crashes or
> assertion failures with it. But for the reason of 2) I'm removing
> exit path between phase2 and phase3 with some explanation why we
> can't exit. Please give me a better wording. 

... and that one.

> > At least, that's what's expected by the ReferenceProcessing
> > code.  The atomic discovery case for phase2 has "always" (since at
> > least the beginning of the mercurial age) ignored the complete_gc
> > argument, and only used it when dealing with the concurrent
> > discovery case.
> > 
> > But G1CopyingKeepAliveClosure appears to have always passed the
> > buck to the complete_gc closure, contrary to the expectations of
> > the RP phase2 code.  So G1 young/mixed collections only work
> > because phase3 will eventually call the complete_gc closure (by
> > each of the threads).
> > (G1 concurrent and full gc's have a cuttoff where no work is left
> > for the complete_gc closure if the object is already marked. I spot
> > checked other collectors, and at a quick glance it looks like the
> > others meet the expectations of the ReferenceProcessor code; it's
> > just G1 young/mixed collections that don't.)
> > 
> > I'm not entirely sure what happens if the number of threads in
> > phase3 is less than the number in phase2 in that case.  I think any
> > pending phase2 work associated with the missing threads will get
> > stolen by phase3 working threads, but haven't verified that.
> > 
> > I think the simplest fix for this is to change phase2 to always
> > call the complete_gc closure after all.  For some collections
> > that's a (perhaps somewhat expensive) nop, but in the overall
> > context of reference processing that's probably in the noise.  And
> > also update the relevant comme
> > The alternative would be to make G1CopyingKeepAliveClosure meet the
> > (not explicitly stated in the API) ReferenceProcessor expectations,
> > perhaps bypassing some of the G1ParScanThreadState machinery.
>  I think this is already covered above.
> 
> > -----------------------------------------------------------------
> > ------------- 
> > src/hotspot/share/gc/shared/referenceProcessor.cpp 
> >  800     RefProcMTDegreeAdjuster a(this, phase_times, ref_count);
> > 
> > This is using ref_count, which was last set on line 759, and not
> > updated after phase2.  It should have been updated as part of the
> > new
> > code block around line 790.
>  I think this is already answered above.
> 
> > -----------------------------------------------------------------
> > ------------- 
> > src/hotspot/share/gc/shared/referenceProcessor.cpp
> > 1193 uint
> > ReferenceProcessor::ergo_proc_thread_count(ReferenceProcessorPhaseT
> > imes* times) const {
> > 1194   size_t ref_count = total_count(_discovered_refs);
> > 1195 
> > 1196   return ergo_proc_thread_count(ref_count, num_queues(),
> > times);
> > 1197 }
> > 
> > I don't think this overload should exist.  The ref_count used here
> > is the SoftReference count, which isn't particularly interesting
> > here. (It's not the total number of discovered references, which is
> > also not an interesting number for this purpose.)
> > 
> > It seems to only exist as part of the workarounds for making ParNew
> > kind of limp along with these changes.  But wouldn't it be simpler
> > to leave ParNewRefProceTaskExecutor::execute alone, using the
> > active_workers as before?  (And recall that I suggested above that
> > execute should just use the currently configured mt-processing
> > degree, rather than adding an ergo_workers argument.) It will be
> > called in the context of RefProcMTDegreeAdjuster, that won't do
> > anything because of ParNew disables mt-degree adjustment.  So I
> > think by leaving things alone here we retain the current behavior,
> > e.g. CMS ParNew uses ParallelRefProcEnabled as an all-or-nothing
> > flag, and doesn't pay attention to the new
> > ReferencesPerThread.  And that seems reasonable to me for a
> > deprecated collector.
>  Done.
> I agree for not supporting CMS so we don't need to add the overload
> version for CMS.
> 
> > -----------------------------------------------------------------
> > ------------- 
> > src/hotspot/share/gc/cms/parNewGeneration.cpp
> > 1455                              false);                     //
> > disable adjusting queue size when processing references
> > 
> > It's not the queue size that adjusted (or not), it's the number of
> > queues.  And really, it's the MT processing degree.
> > 
> > That last probably should also apply to the name of the variable in
> > the reference processor, and related names.  Although given that
> > this is all a bit of a kludge to work around (deprecated) CMS
> > deficiencies, maybe I shouldn't be too concerned.  And you did file
> > JDK-8203951.
> > 
> > Though I see Thomas also requested a name change...
> > 
> > If you change the name, remember to update JDK-8203951.
>  Sure, I will update the name at JDK-8203951.
> webrev.2 is proposing "bool _adjust_no_of_processing_threads".
> 
> > -----------------------------------------------------------------
> > ------------- 
> > src/hotspot/share/gc/shared/referenceProcessor.cpp
> > 1199 uint ReferenceProcessor::ergo_proc_thread_count(size_t
> > ref_count,
> > 1200                                                 uint
> > max_threads,
> > 1201                                                 ReferenceProce
> > ssorPhaseTimes* times) const {
> > ...
> > 1204   if (ReferencesPerThread == 0) {
> > 1205     return _num_queues;
> > 1206   }
> > 
> > Why is _num_queues used here, but max_threads used elsewhere in
> > this function.  I *think* this ought to also be max_threads.
>  Agreed, so changed to return max_threads.
> Basically _num_queues and max_threads are same value here but for
> consistency, max_threads is better. :)
> 
> As you may know, the intent of exiting here is to avoid SIGFPE at   
> size_t thread_count = 1 + (ref_count / ReferencesPerThread);
> 
> Basically this method will not be called if ReferencesPerThread == 0
> because it will be filtered out at line 1235. i.e. ctor of
> RefProcMTDegreeAdjuster.
> But I wanted to have something to avoid SIGFPE within this method
> too.
> too.
> 
> > -----------------------------------------------------------------
> > ------------- 
> > src/hotspot/share/gc/shared/referenceProcessor.cpp  
> > 1229
> > RefProcMTDegreeAdjuster::RefProcMTDegreeAdjuster(ReferenceProcessor
> > * rp,
> > ...
> > 1235   if (!_rp->has_adjustable_queue() || (ReferencesPerThread ==
> > 0)) {
> > 
> > This checks ReferencesPerThread for 0 to do nothing.  And then
> > ergo_proc_thread_count similarly checks for 0 to (effectively) do
> > nothing. If the earlier suggestion to kill the 1-arg overload is
> > taken, then ergo_proc_thread_count is really just a helper for
> > RefProcMTDegreeAdjuster, and we don't need that special case twice.
>  The check for ReferencesPerThread is to disable this feature.
> 
> > And perhaps it should be a private helper in
> > RefProcMTDegreeAdjuster?
> > Similarly for use_max_threads.  Or maybe just inline them into that
> > class's constructor.  Good idea and it was one of my original patch
> > before considering to support CMS.
> Done.
> 
> > -----------------------------------------------------------------
> > ------------- 
> > src/hotspot/share/gc/shared/referenceProcessor.cpp 
> > 1199 uint ReferenceProcessor::ergo_proc_thread_count(size_t
> > ref_count,
> > ...
> > 1213   return (uint)MIN3(thread_count,
> > 1214                     static_cast<size_t>(max_threads),
> > 1215                     (size_t)os::initial_active_processor_count
> > ());
> > 
> > I don't see why this should look at initial_active_processor_count?
> > 
> > -----------------------------------------------------------------
> > -------------
>  As I noted in pre-review email thread, it showed better results if I
> avoid using more than active cpus.
> This is a kind of enhancement for better performance, so even though
> the command-line option was set to use more than cpus(via
> ParalelGCThreads), I think it is good to limit. This is different
> from command-line option processing which lets users decide even
> though the given options would results in worse performance.
> 
> Or if you are saying I should use "active_processor_count()"? If it
> is the case, I updated to use 'active_processor_count()' instead of
> 'initial_active_processor_count()'.
> 
> I would prefer to have this unless there's any reason of not having
> it.

First, always use active_processor_count() for current thread
decisions. Initial_active_processor_count() is only here to get a
consistent value  of the processor count during initialization.

What has been your test to determine this? GCOld?

I think this may be outdated as with recent changes scalability (with
G1) should have improved a lot.

> ----
> In addition:
> 1) I removed condition check of (num_queues() >1) because it would
> affect to a caller side expectation for per worker state information
> etc.. I never faced any problem though.
> 2) Minor changes such as renaming. e.g.
> RefProcMTDegreeAdjuster::_saved_num_queue to _saved_num_queues.
> 
> Webrev:
> http://cr.openjdk.java.net/~sangheki/8043575/webrev.2
> http://cr.openjdk.java.net/~sangheki/8043575/webrev.2_to_1
> Testing: hs-tier1~5 with/without ParallelRefProcEnabled

There is a WIP webrev at

http://cr.openjdk.java.net/~tschatzl/8043575/webrev.3/

for the curious (I think accomodated the review comments so far too).
Still needs at least some more testing.

And I need to remove the parallel gc support.

I am also not sure whether there has been a decision on how this
feature should be enabled, i.e. which switches need to be active to use
it.
I made it so that if ParallelRefProcEnabled is true, and
ReferencesPerThread > 0, we enable the dynamic number of reference
processing threads.
I.e. if ParallelRefProcEnabled is false, dynamic number of reference
processing threads is also turned off (because the user requested to
not do parallel ref processing at all).

Thanks,
  Thomas