RFR: 8137022: Concurrent refinement thread adjustment and (de-)activation suboptimal [v3]

Stefan Johansson sjohanss at openjdk.org
Tue Sep 27 14:03:30 UTC 2022


On Fri, 23 Sep 2022 06:39:33 GMT, Kim Barrett <kbarrett at openjdk.org> wrote:

>> 8137022: Concurrent refinement thread adjustment and (de-)activation suboptimal  
>> 8155996: Improve concurrent refinement green zone control
>> 8134303: Introduce -XX:-G1UseConcRefinement
>> 
>> Please review this change to the control of concurrent refinement.
>> 
>> This new controller takes a different approach to the problem, addressing a
>> number of issues.
>> 
>> The old controller used a multiple of the target number of cards to determine
>> the range over which increasing numbers of refinement threads should be
>> activated, and finally activating mutator refinement.  This has a variety of
>> problems.  It doesn't account for the processing rate, the rate of new dirty
>> cards, or the time available to perform the processing.  This often leads to
>> unnecessary spikes in the number of running refinement threads.  It also tends
>> to drive the pending number to the target quickly and keep it there, removing
>> the benefit from having pending dirty cards filter out new cards for nearby
>> writes.  It can't delay and leave excess cards in the queue because it could
>> be a long time before another buffer is enqueued.
>> 
>> The old controller was triggered by mutator threads enqueing card buffers,
>> when the number of cards in the queue exceeded a threshold near the target.
>> This required a complex activation protocol between the mutators and the
>> refinement threads.
>> 
>> With the new controller there is a primary refinement thread that periodically
>> estimates how many refinement threads need to be running to reach the target
>> in time for the next GC, along with whether to also activate mutator
>> refinement.  If the primary thread stops running because it isn't currently
>> needed, it sleeps for a period and reevaluates on wakeup.  This eliminates any
>> involvement in the activation of refinement threads by mutator threads.
>> 
>> The estimate of how many refinement threads are needed uses a prediction of
>> time until the next GC, the number of buffered cards, the predicted rate of
>> new dirty cards, and the predicted refinement rate.  The number of running
>> threads is adjusted based on these periodically performed estimates.
>> 
>> This new approach allows more dirty cards to be left in the queue until late
>> in the mutator phase, typically reducing the rate of new dirty cards, which
>> reduces the amount of concurrent refinement work needed.
>> 
>> It also smooths out the number of running refinement threads, eliminating the
>> unnecessarily large spikes that are common with the old method.  One benefit
>> is that the number of refinement threads (lazily) allocated is often much
>> lower now.  (This plus UseDynamicNumberOfGCThreads mitigates the problem
>> described in JDK-8153225.)
>> 
>> This change also provides a new method for calculating for the number of dirty
>> cards that should be pending at the start of a GC. While this calculation is
>> conceptually distinct from the thread control, the two were significanly
>> intertwined in the old controller.  Changing this calculation separately and
>> first would have changed the behavior of the old controller in ways that might
>> have introduced regressions.  Changing it after the thread control was changed
>> would have made it more difficult to test and measure the thread control in a
>> desirable configuration.
>> 
>> The old calculation had various problems that are described in JDK-8155996.
>> In particular, it can get more or less stuck at low values, and is slow to
>> respond to changes.
>> 
>> The old controller provided a number of product options, none of which were
>> very useful for real applications, and none of which are very applicable to
>> the new controller.  All of these are being obsoleted.
>> 
>> -XX:-G1UseAdaptiveConcRefinement
>> -XX:G1ConcRefinementGreenZone=<buffer-count>
>> -XX:G1ConcRefinementYellowZone=<buffer-count>
>> -XX:G1ConcRefinementRedZone=<buffer-count>
>> -XX:G1ConcRefinementThresholdStep=<buffer-count>
>> 
>> The new controller *could* use G1ConcRefinementGreenZone to provide a fixed
>> value for the target number of cards, though it is poorly named for that.
>> 
>> A configuration that was useful for some kinds of debugging and testing was to
>> disable G1UseAdaptiveConcRefinement and set g1ConcRefinementGreenZone to a
>> very large value, effectively disabling concurrent refinement.  To support
>> this use case with the new controller, the -XX:-G1UseConcRefinement diagnostic
>> option has been added (see JDK-8155996).
>> 
>> The other options are meaningless for the new controller.
>> 
>> Because of these option changes, a CSR and a release note need to accompany
>> this change.
>> 
>> Testing:
>> mach5 tier1-6
>> various performance tests.
>> local (linux-x64) tier1 with -XX:-G1UseConcRefinement
>> 
>> Performance testing found no regressions, but also little or no improvement
>> with default options, which was expected.  With default options most of our
>> performance tests do very little concurrent refinement.  And even for those
>> that do, while the old controller had a number of problems, the impact of
>> those problems is small and hard to measure for most applications.
>> 
>> When reducing G1RSetUpdatingPauseTimePercent the new controller seems to fare
>> better, particularly when also reducing MaxGCPauseMillis.  specjbb2015 with
>> MaxGCPauseMillis=75 and G1RSetUpdatingPauseTimePercent=3 (and other options
>> held constant) showed a statistically significant improvement of about 4.5%
>> for critical-jOPS.  Using the changed controller, the difference between this
>> configuration and the default is fairly small, while the baseline shows
>> significant degradation with the more restrictive options.
>> 
>> For all tests and configurations the new controller often creates many fewer
>> refinement threads.
>
> Kim Barrett has updated the pull request incrementally with three additional commits since the last revision:
> 
>  - wanted vs needed nomenclature
>  - remove several spurious "scan"
>  - delay => wait_time_ms

Did some testing with the patch to get a feel for the log changes and found a couple of things that we might want to improved.

src/hotspot/share/gc/g1/g1ConcurrentRefine.cpp line 345:

> 343:                         num_cards,
> 344:                         _threads_needed.predicted_cards_at_next_gc(),
> 345:                         _threads_needed.predicted_time_until_next_gc_ms());

During my testing with fixed IR jbb I saw this kind of frequently: 

[19,856s][trace][gc,refine      ] Activated worker 0, current: 79255
[19,856s][debug][gc,refine      ] Updating refinement threads: wanted 0, cards: 79255, predicted: 79255, time: 0,00ms
[19,856s][trace][gc,refine      ] Deactivated worker 0, cards: 79255, refined 0, rate 0,00c/ms
[19,906s][trace][gc,refine      ] Activated worker 0, current: 79511
[19,906s][debug][gc,refine      ] Updating refinement threads: wanted 0, cards: 79511, predicted: 79511, time: 0,00ms
[19,906s][trace][gc,refine      ] Deactivated worker 0, cards: 79511, refined 0, rate 0,00c/ms
[19,956s][trace][gc,refine      ] Activated worker 0, current: 79511
[19,956s][debug][gc,refine      ] Updating refinement threads: wanted 0, cards: 79511, predicted: 79511, time: 0,00ms
[19,956s][trace][gc,refine      ] Deactivated worker 0, cards: 79511, refined 0, rate 0,00c/ms
[20,007s][trace][gc,refine      ] Activated worker 0, current: 79511
[20,007s][debug][gc,refine      ] Updating refinement threads: wanted 0, cards: 79511, predicted: 79511, time: 0,00ms
[20,007s][trace][gc,refine      ] Deactivated worker 0, cards: 79511, refined 0, rate 0,00c/ms
[20,057s][trace][gc,refine      ] Activated worker 0, current: 79767
[20,057s][debug][gc,refine      ] Updating refinement threads: wanted 0, cards: 79767, predicted: 79767, time: 0,00ms
[20,057s][trace][gc,refine      ] Deactivated worker 0, cards: 79767, refined 0, rate 0,00c/ms
[20,107s][trace][gc,refine      ] Activated worker 0, current: 79767
[20,107s][debug][gc,refine      ] Updating refinement threads: wanted 0, cards: 79767, predicted: 79767, time: 0,00ms
[20,107s][trace][gc,refine      ] Deactivated worker 0, cards: 79767, refined 0, rate 0,00c/ms


>From what I can tell, in those cases we are not really updating the refinement threads, but keeping them at 0. I think this log should only be printed under the condition that `new_wanted` is different from `old_wanted`. Makes sense?

src/hotspot/share/gc/g1/g1ConcurrentRefineThread.cpp line 61:

> 59:     SuspendibleThreadSetJoiner sts_join;
> 60:     G1ConcurrentRefineStats active_stats_start = _refinement_stats;
> 61:     report_active("Activated");

Because `G1ConcurrentRefineThread::run_service()` is shared between the primary and secondary thread there will be an "Activated" log message for each periodic wake of the primary refinement thread. This is on the trace level, but I think we should only call `report_active()` when we know there is work to be done by the thread, not just checks.

src/hotspot/share/gc/g1/g1Policy.cpp line 963:

> 961:             hcc_cards,
> 962:             merge_hcc_time_ms,
> 963:             (exceeded_goal ? " (exceeded goal)" : ""));

Since the pending cards target is initialized to `SIZE_MAX`, the first GC this will always print: 

GC(0) GC refinement: goal: 18446744073709551615 ...


Not sure if printing 0 is much better but it looks more "buggy" to print `SIZE_MAX` imo.

-------------

Changes requested by sjohanss (Reviewer).

PR: https://git.openjdk.org/jdk/pull/10256


More information about the hotspot-dev mailing list