RFR: 8137022: Concurrent refinement thread adjustment and (de-)activation suboptimal

Wed Sep 21 10:45:47 UTC 2022

On Wed, 14 Sep 2022 00:36:18 GMT, Kim Barrett <kbarrett at openjdk.org> wrote:

> 8137022: Concurrent refinement thread adjustment and (de-)activation suboptimal  
> 8155996: Improve concurrent refinement green zone control
> 8134303: Introduce -XX:-G1UseConcRefinement
> 
> Please review this change to the control of concurrent refinement.
> 
> This new controller takes a different approach to the problem, addressing a
> number of issues.
> 
> The old controller used a multiple of the target number of cards to determine
> the range over which increasing numbers of refinement threads should be
> activated, and finally activating mutator refinement.  This has a variety of
> problems.  It doesn't account for the processing rate, the rate of new dirty
> cards, or the time available to perform the processing.  This often leads to
> unnecessary spikes in the number of running refinement threads.  It also tends
> to drive the pending number to the target quickly and keep it there, removing
> the benefit from having pending dirty cards filter out new cards for nearby
> writes.  It can't delay and leave excess cards in the queue because it could
> be a long time before another buffer is enqueued.
> 
> The old controller was triggered by mutator threads enqueing card buffers,
> when the number of cards in the queue exceeded a threshold near the target.
> This required a complex activation protocol between the mutators and the
> refinement threads.
> 
> With the new controller there is a primary refinement thread that periodically
> estimates how many refinement threads need to be running to reach the target
> in time for the next GC, along with whether to also activate mutator
> refinement.  If the primary thread stops running because it isn't currently
> needed, it sleeps for a period and reevaluates on wakeup.  This eliminates any
> involvement in the activation of refinement threads by mutator threads.
> 
> The estimate of how many refinement threads are needed uses a prediction of
> time until the next GC, the number of buffered cards, the predicted rate of
> new dirty cards, and the predicted refinement rate.  The number of running
> threads is adjusted based on these periodically performed estimates.
> 
> This new approach allows more dirty cards to be left in the queue until late
> in the mutator phase, typically reducing the rate of new dirty cards, which
> reduces the amount of concurrent refinement work needed.
> 
> It also smooths out the number of running refinement threads, eliminating the
> unnecessarily large spikes that are common with the old method.  One benefit
> is that the number of refinement threads (lazily) allocated is often much
> lower now.  (This plus UseDynamicNumberOfGCThreads mitigates the problem
> described in JDK-8153225.)
> 
> This change also provides a new method for calculating for the number of dirty
> cards that should be pending at the start of a GC. While this calculation is
> conceptually distinct from the thread control, the two were significanly
> intertwined in the old controller.  Changing this calculation separately and
> first would have changed the behavior of the old controller in ways that might
> have introduced regressions.  Changing it after the thread control was changed
> would have made it more difficult to test and measure the thread control in a
> desirable configuration.
> 
> The old calculation had various problems that are described in JDK-8155996.
> In particular, it can get more or less stuck at low values, and is slow to
> respond to changes.
> 
> The old controller provided a number of product options, none of which were
> very useful for real applications, and none of which are very applicable to
> the new controller.  All of these are being obsoleted.
> 
> -XX:-G1UseAdaptiveConcRefinement
> -XX:G1ConcRefinementGreenZone=<buffer-count>
> -XX:G1ConcRefinementYellowZone=<buffer-count>
> -XX:G1ConcRefinementRedZone=<buffer-count>
> -XX:G1ConcRefinementThresholdStep=<buffer-count>
> 
> The new controller *could* use G1ConcRefinementGreenZone to provide a fixed
> value for the target number of cards, though it is poorly named for that.
> 
> A configuration that was useful for some kinds of debugging and testing was to
> disable G1UseAdaptiveConcRefinement and set g1ConcRefinementGreenZone to a
> very large value, effectively disabling concurrent refinement.  To support
> this use case with the new controller, the -XX:-G1UseConcRefinement diagnostic
> option has been added (see JDK-8155996).
> 
> The other options are meaningless for the new controller.
> 
> Because of these option changes, a CSR and a release note need to accompany
> this change.
> 
> Testing:
> mach5 tier1-6
> various performance tests.
> local (linux-x64) tier1 with -XX:-G1UseConcRefinement
> 
> Performance testing found no regressions, but also little or no improvement
> with default options, which was expected.  With default options most of our
> performance tests do very little concurrent refinement.  And even for those
> that do, while the old controller had a number of problems, the impact of
> those problems is small and hard to measure for most applications.
> 
> When reducing G1RSetUpdatingPauseTimePercent the new controller seems to fare
> better, particularly when also reducing MaxGCPauseMillis.  specjbb2015 with
> MaxGCPauseMillis=75 and G1RSetUpdatingPauseTimePercent=3 (and other options
> held constant) showed a statistically significant improvement of about 4.5%
> for critical-jOPS.  Using the changed controller, the difference between this
> configuration and the default is fairly small, while the baseline shows
> significant degradation with the more restrictive options.
> 
> For all tests and configurations the new controller often creates many fewer
> refinement threads.

Took a first pass over the code and I really like the new approach. Feels much simpler to reason about and understand. 

I have a few small suggestions/questions.

src/hotspot/share/gc/g1/g1ConcurrentRefine.cpp line 267:

> 265:     // If target is "unbounded" then wait forever (until explicitly
> 266:     // activated).  This happens during startup, when we don't bother with
> 267:     // refinement.

We use `SIZE_MAX` to signal that the target is "unbounded", like the comment says. We do this comparison a few times and I think it would make sense to define a constant with a more descriptive name for this.

src/hotspot/share/gc/g1/g1ConcurrentRefineThread.cpp line 174:

> 172:     if (!try_refinement_step(cr()->pending_cards_target())) {
> 173:       // Proceed with fewer threads if target reached.
> 174:       cr()->reduce_threads_wanted();

Not sure I see the reason to never try refinement when adjusted. I get that there is a case where we don't need to do refinement but that seems to be handled by an "early return" in `try_refinement_step(...)`. 

If we really want to skip refinement I would prefer an early return here as well. 

if (cr()->adjust_threads_periodically()) {
  // If adjustment done, don't do any refinement this round, since this thread
  // might no longer be wanted active.
  return;
}
...

src/hotspot/share/gc/g1/g1ConcurrentRefineThreadsNeeded.hpp line 40:

> 38:   G1Policy* _policy;
> 39:   double _update_period_ms;
> 40:   double _predicted_time_ms;

I like to keep the names short but I wonder if it would make sense to include until gc here:  `_predicted_time_until_gc_ms`

src/hotspot/share/gc/g1/g1Policy.cpp line 1124:

> 1122:   uint remaining = target - MIN2(target, full);
> 1123:   size_t bytes = remaining * HeapRegion::GrainBytes;
> 1124:   return bytes - MIN2(bytes, allocator->used_in_alloc_regions());

>From what I can tell it's only this last line that require the `Heap_lock` to be held, so we could still do an estimate (slightly worse) even if we fail to take the lock. So my question is would it make sense to move the `Heap_lock->try_lock()` in here and just skip including `allocator->used_in_alloc_regions()` in the calculation when not getting the lock?

-------------

Changes requested by sjohanss (Reviewer).

PR: https://git.openjdk.org/jdk/pull/10256