Increased ScanRS time when decreasing G1RSetUpdatingPauseTimePercent

Mon Jan 20 14:17:16 UTC 2020

Hi Joakim,

On 19.01.20 12:02, Joakim Thun wrote:
> Hi all,
> 
> I would really appreciate some help understanding a G1 behaviour I am 
> seeing when decreasing the value of G1RSetUpdatingPauseTimePercent where 
> the goal is to decrease the time spent in the UpdateRS phase by moving 
> some of the work to be processed concurrently by the refinement threads.
> 
> The behaviour I was expecting to see was a decrease in UpdateRS time 
> which I am seeing but at the expense of more time being spent in the 
> ScanRS phase so the end result i.e. the total pause time end up being 
> very similar with and without the flag set. Decreasing 
> G1RSetUpdatingPauseTimePercent to both 5 and 1 results in similar 
> behaviour. I noticed that the number of scanned cards is much higher in 
> the ScanRS phase when decreasing G1RSetUpdatingPauseTimePercent.
> 
> Is this expected behaviour?
> 

TLDR: yes.

Longer version:

The refinement threads and the refinement queues (which are processed 
during Update RS) purpose is to update the remembered sets (attributed 
in the Scan RS time) after some filtering (is that card already in a 
remembered set? Can we drop it for other reasons?)

If an entry/card in the refinement queues has not been processed before 
GC, it must be during GC (not the entire filtering needs to be applied 
there).

What is cheaper to do during GC, scanning remembered sets or refinement 
queues? Depends on the contents of the card. If it contains references 
to a lot of regions in the collection set, then it is probably cheaper 
to let it stay in the refinement queue. If it does not contain a 
reference to any region in the collection set, then putting it into the 
remembered sets it's a win because we moved otherwise unnecessary work 
out of the pause.

There are a lot of different arguments about what the optimal location 
for a card should be; some of these decisions have impact outside of the 
gc pause too.
E.g. a card in the refinement queue not yet processed is never 
re-enqueued - this saves enqueuing and processing work at mutator time; 
however, given that they may not contain cards that are in the 
collection set (which you know if you process them), keeping them would 
make pause slightly time longer.
As long as the card in the refinement buffer contains a reference to the 
collection set, G1 would scan it anyway (it would be in some remembered 
set), and retrieving values from the refinement queue during gc is (very 
slightly) faster than from the remembered sets.

Overall there is no rule that "Update RS" work is bad while "Scan RS" isn't.

In your case, since you are trading Update RS with Scan RS time, I would 
argue that it's better to have the cards in the refinement queue.

> Are there any other flags worth considering to improve the ScanRS time 
> while moving more work to the refinement threads?

One could try to manually control refinement work by manually setting 
the various thresholds. No guarantees that this improves your situation.

Logging "gc+ergo+refine=debug" may help with debugging the adaptive 
refinement thresholds; gc+remset=trace gives some general information 
about concurrent refinement.

Some rundown on the options:

G1UseAdaptiveConcRefinement: enable adaptive refinement, ie. try to 
observe G1UpdatePauseTimePercent.

G1UpdateBufferSize (default 256): size of a buffer in the refinement 
queue, i.e. individual threads will cache that amount of cards to 
process later until they are made available to the refinement threads.

G1ConcRefinementGreenZone, G1ConcRefinementYellowZone, 
G1ConcRefinementRedZone: some thresholds that control refinement 
threads. If the number of buffers (see above) is lower than the green 
threshold, there is no concurrent refinement activity. From green to 
yellow threshold increasingly more concurrent refinement threads will be 
used. If the threshold reaches red, mutator threads will do the work.

If G1UseAdaptiveConcRefinement is enabled, the thresholds are changed 
adaptively, and the ones you give on the command line are initial 
values. Otherwise the thresholds are fixed.

G1ConcGCThreads: max number of refinement threads.

So you could completely disable concurrent refinement by disabling 
G1UseAdaptiveConcRefinement, and setting G1ConcGCThreads=0; this will 
make the mutators do all the work immediately if you set the red 
threshold to 0 too. If you set the G1UpdateBufferSize to 1 too, the 
mutators will immediately do all work I think (this will likely have a 
significant impact on mutator performance).

Otherwise, using the thresholds, you can, in a very granular way select 
the amount of concurrent refinement work.

Thanks,
   Thomas