Re: RFR: 8236073: G1: Use SoftMaxHeapSize to guide GC heuristics

Wed Jan 15 03:52:02 UTC 2020

Hi Thomas,

I summarize the issues in as following:

1. Criterion of SoftMaxHeapSize
I agree to keep the policy of SoftMaxHeapSize similar with ZGC to make
it unified. So "expand_heap_after_young_collection" is used for meeting the
basic GCTimeRatio and expand heap immediately which cannot be blocked by
 any reasons. "adjust_heap_after_young_collection" cannot change the logic
and I will take both expansion and shrink into consideration. Is my 
understanding correct here?

2. Full GC with SoftMaxHeapSize
In my thought non-explicit Full GC probably means the insufficiency of heap
 capacity and we may not keep shrinking within SoftMaxHeapSize but explicit 
FGC don't have that issue. That's the only reason why I checked if it is 
explicit. But we will have the same determine logic to check if the heap
can be shrinked so "explicit" check could be meaningless and I will remove that.

3. SoftMaxHeapSizeConstraintFunc doesn't check Xms
The constraint function didn't make sure the SoftMaxHeapSize should less 
than Xms. Do we need to add the checking? It will not only affect G1...

4. commit/uncommit parallelism
The concurrent uncommit will work with VMThread doing GC and GC may request
to expand heap if not enough empty regions. So the parallelism is possible and
immediate uncommit is a solution.

4. More heap expansion/shrink heuristics further
We have some data and experience in dynamimc heap adjustment in our workloads.
The default GCTimeRatio 12 is really well tuned number that we found applications
will have obvious timeout erros if it is less than ~12. So it is kind of *hard*
limit and we need to expand immediately if GCTimeRatio drops below 12. The
 difference in our workloads is that we will keep a GCTimeRatio nearly the original
 value 99 to make GC in a heathy state because allocation rate and outside input
can vary violently that we don't want frequent adjustment. You know that in our 8u 
implementation we just keep a conservative GC interval to achieve that. Comparing to the
current code in JDK15, keeping GCTimeRatio as 99 is a different behavior which might
have more memory footprint. I propose if we can still use the original option
 "-XX:+G1ElasticHeap" to keep the GCTimeRatio around 99 or a specified number.
The default flow will make sure the GCTimeRatio is above the threshold 12 and concurrent
commit/uncommit will adjust the heap to keep GCTimeRatio in a proper number that
the adjustment is not urgent.

Thanks,
Liang

------------------------------------------------------------------
From:Thomas Schatzl <thomas.schatzl at oracle.com>
Send Time:2020 Jan. 14 (Tue.) 19:36
To:"MAO, Liang" <maoliang.ml at alibaba-inc.com>; hotspot-gc-dev <hotspot-gc-dev at openjdk.java.net>
Subject:Re: RFR: 8236073: G1: Use SoftMaxHeapSize to guide GC heuristics

Hi,

On 14.01.20 10:07, Liang Mao wrote:
> Hi Thomas,
> 
> Thank you for the detailed comments!
> Most of suggestions I will follow to do the modification.
> And I still have some questions:
> 
>>> 1. Does the SoftMaxHeapSize limitation need to consider the GC time 
>>> ratio as in
>>> expand_heap_after_young_collection? Now we haven't put the logic in yet.
> 
>> I am not completely clear what you are asking about, but the gc time 
>> ratio only affects the current "optimal" heap size which is bounded by 
>> SoftMaxHeapsize/max_capacity.
> 
> The decision to shrink to SoftMaxHeapSize in this patch is based
> on the method "G1HeapSizingPolicy::can_shrink_heap_size_to" which
> counts "used" + "reserve" + "young". We will change it to
> _heap_sizing_policy->shrink_amount(); as you commented.
> I'm not considering the GC time ratio as a factor to determine
> whether the heap can be shrinked to SoftMaxHeapSize.

Going back to the "spec":

"When -XX:SoftMaxHeapSize is set, the GC should strive to not grow heap 
size beyond the specified size, unless the GC decides it's necessary to 
do so. The soft max heap size should not be allowed to be set to a value 
smaller than min heap size (-Xms) or greater than max heap size (-Xmx). 
When not set on the command-line, this flag should default to the max 
heap size." (https://bugs.openjdk.java.net/browse/JDK-8222181)

This is a very loose definition, and "unless the GC decides it's 
necessary" may mean anything.

Looking at ZGC code, it mostly uses it to drive the GCs (and determine 
expansion amount), and let regular expansion/shrinking do the work 
without new rules.

I would at first tend to do the same: if existing heap policy indicates 
that an expansion at young gc (which takes GCTimeRatio into account) is 
needed for whatever reason, I would actually let G1 keep doing it; 
conversely I would also take GCTimeRatio into account when trying to 
shrink to keep the policy symmetric.

The current implementation probably preferentially shrinks towards 
SoftMaxHeapSize, correct? (this shrinking also seems to be limited to 
exactly SoftMaxHeapSize - why not below that if the collector thinks it 
would be okay?)

Do you have measurements with/without GCTimeRatio in the shrinking? Can 
you describe, with over-time heap occupancy graphs that this does not 
work at all in your workloads?

Measurements of committed heap over time of the current solution would 
be appreciated too.

(I haven't had the time yet to set up some useful testing application 
that can be used to simulate phases of such a workload to show heap 
shrinking but I assume you have some data.)

>> - changes in G1CollectedHeap::resize_heap_if_necessary: please make the 
>> method to always use the soft max heap size as I do not understand why 
>> you would not do that.
> 
> Do you think we need to apply the logic "can_shrink_heap_size_to"
> inside resize_heap_if_necessary to determine whether to make soft
> max size as limit?

Resize_heap_if_necessary() is the equivalent of 
adjust_heap_after_young_collection() for after full gc (the naming could 
be unified at some point).

What the change implements right now is to respect SoftMaxHeapSize only 
on an explicit gc in resize_heap_if_necessary(), while always respecting 
it during young gcs.

Do you have a reason for this decision? This seems like an inconsistency 
I can not find a reason for.

As mentioned above, I would try to keep SoftMaxHeapSize only a goal for 
starting (concurrent) garbage collections, with "natural" sizing trying 
to keep the SoftMaxHeapSize goal.

Particularly, if a (compacting) full gc won't meet the SoftMaxHeapSize, 
what else will? It is indeed unfortunate that you might need to tweak 
Min/MaxFreeRatio to achieve higher uncommit ratio at full gc...

This change (system.gc specifically trying to meet SoftMaxHeapSize) also 
seems to be an artifact of your usage of this feature - maybe you happen 
to always issue a system.gc() after you changed SoftMaxHeapSize?
It may probably be better if a concurrent cycle were triggered done 
automatically similar to periodic gcs.

>>  - there is a bug in the synchronization of the concurrent uncommit 
>> thread: it seems possible that the uncommit thread is still working 
>> (iterating over the list of regions to uncommit) while a completed 
>> (young) GC may add new regions to that list as young gcs do not wait for 
>> completion of the uncommit thread.
> 
> Uncommit thread could be working parallelly with VMThread but
> VMThread will not add regions to the concurrent_resizing_list
> if concurrent resizing thread is in "working" state.

Okay, I now see the check in 
G1CollectedHeap::adjust_heap_after_young_collection(), but that also 
prohibits expansion during young GC which seems relatively disruptive. I 
think I mentioned earlier, that in this case (if we want to expand) it 
might be better to wait on completion of the parallel uncommit _if_ the 
other remaining regions are not enough.

(Alternatively one could add another uncommit/commit request to a 
hypothetical uncommit task queue for that resize thread).

My thinking is that at worst, this would result in the same behavior as 
before (i.e. blocking because of commit/uncommit in progress) instead of 
changing the overall behavior by denying expansion requests (silently, 
which is pretty bad). This could probably result in the heap sizing 
policy to get a few samples with high gc time ratio, expanding even more 
when G1 is finally allowed to expand. (Also, fortunately 
UseGCOverheadLimit is not implemented in G1, as that might kill the VM 
spuriously because of that...)

Or do you have some reason for this particular implementation?

(The reason I am asking so much about details is to get a common 
understanding on the solution, i.e. what should happen when, and 
hopefully including why for the next guy ;))

Maybe it would be good to refactor the code in 
G1CollectedHeap::adjust_heap_after_young_collection() (similar to or 
expand in G1CollectedHeap::adjust_heap_after_young_collection()), e.g. 
calculate some heap_size_change() value from the heap sizing policy, and 
then use that value depending on whether it is positive or negative to 
expand or shrink.

> 
>> Related to that is the use of par_set/clear_bit in e.g. the available 
>> bitmap: since all par/clear actions are asserted to be in the vm thread 
>> at a safepoint, there does not seem to be a need for using the parallel 
>> variants of set/clear bit (if keeping the current mechanism).
> 
> For above reason that concurrent uncommit can run parallely with
> VMThread, the bit set/clear invm thread at safepint have
> to be parallel.

If concurrent uncommit is working, both heap shrinking and expansion in 
the safepoint is disabled as far as I can tell. I.e.

2996 void G1CollectedHeap::adjust_heap_after_young_collection() {
2997   if (concurrent_heap_resize()->resize_thread()->during_cycle()) {
2998     // Do not adjust if concurrent resizing is in progress
2999     return;
3000   }
3001
3002   double start_time_ms = os::elapsedTime();
3003   shrink_heap_after_young_collection();
3004   phase_times()->recor...[..]
3005
        // Shrinking might have started resizing
3006   if (!concurrent_heap_resize()->resize_thread()->during_cycle()) {
3007     expand_heap_after_young_collection();
3008   }
3009 }

This method seems to be the only caller to 
expand/shrink_heap_after_young_collection().

Another question I had but forgot is that the thread and in other 
places "resize" is used in the method and class names instead of e.g. 
"uncommit" or "shrink". Do you plan to add concurrent commit too?

>>  - I have a feeling that if the concurrent uncommit thread worked on 
>> pages, not regions, the code would be easier to understand. It would 
>> also solve the issue you asked about with the 
>> G1RegionsSmallerThanCommitSizeMapper. You may still need to pass region 
>> numbers anyway for logging, but otoh the logging could be done 
>> asynchroniusly.
> 
> I don't quite understand this part... For 
> the G1RegionsSmallerThanCommitSizeMapper,
> a page can be simultaneously requested to commit in VMThread to expand
> heap and uncommit in concurrent thread to shrink heap. Looks like lowering
> uncommit work to page level couldn't help this...

I see the problem, but as above I think expansion (commit) and shrinking 
(uncommit) can't happen at the same time at the moment, and it actually 
might be premature to allow expansion and shrinking occur at the same time.

Some more options are serializing these requests; one I described above 
(if a request is pending, wait for its completion) before starting a new 
one. This might be the default option for other reasons anyway.

Another is certainly to simply not uncommit them concurrently then, 
maybe we can still uncommit them immediately?

> For the features you listed below,
>     1) moving the memory uncommit into the concurrent phase
>     2) uncommit at the end of (almost) every GC
>     3) SoftMaxHeapSize
> 
> Since most of code is for the concurrent framework,
> do you think 2) and 3) canbe together and implemented first?
> (The uncommit will happen immediately)
I think that would be fine.

Thanks,
   Thomas