Re: RFR: 8236073: G1: Use SoftMaxHeapSize to guide GC heuristics

Tue Jan 14 09:07:50 UTC 2020

Hi Thomas,

Thank you for the detailed comments!
Most of suggestions I will follow to do the modification. 
And I still have some questions:

>> 1. Does the SoftMaxHeapSize limitation need to consider the GC time 
>> ratio as in
>> expand_heap_after_young_collection? Now we haven't put the logic in yet.

> I am not completely clear what you are asking about, but the gc time 
> ratio only affects the current "optimal" heap size which is bounded by 
> SoftMaxHeapsize/max_capacity.

The decision to shrink to SoftMaxHeapSize in this patch is based 
on the method "G1HeapSizingPolicy::can_shrink_heap_size_to" which
counts "used" + "reserve" + "young". We will change it to 
_heap_sizing_policy->shrink_amount(); as you commented.
I'm not considering the GC time ratio as a factor to determine
whether the heap can be shinked to SoftMaxHeapSize.

> - changes in G1CollectedHeap::resize_heap_if_necessary: please make the 
> method to always use the soft max heap size as I do not understand why 
> you would not do that.

Do you think we need to apply the logic "can_shrink_heap_size_to"
inside resize_heap_if_necessary to determine whether to make soft
 max size as limit?

>  - there is a bug in the synchronization of the concurrent uncommit 
> thread: it seems possible that the uncommit thread is still working 
> (iterating over the list of regions to uncommit) while a completed 
> (young) GC may add new regions to that list as young gcs do not wait for 
> completion of the uncommit thread.

Uncommit thread could be working parallelly with VMThread but
VMThread will not add regions to the concurrent_resizing_list
if concurrent resizing thread is in "working" state.

> Related to that is the use of par_set/clear_bit in e.g. the available 
> bitmap: since all par/clear actions are asserted to be in the vm thread 
> at a safepoint, there does not seem to be a need for using the parallel 
> variants of set/clear bit (if keeping the current mechanism).

For above reason that concurrent uncommit can run parallely with
VMThread, the bit set/clear in vm thread at safepint have
to be parallel.

>  - I have a feeling that if the concurrent uncommit thread worked on 
> pages, not regions, the code would be easier to understand. It would 
> also solve the issue you asked about with the 
> G1RegionsSmallerThanCommitSizeMapper. You may still need to pass region 
> numbers anyway for logging, but otoh the logging could be done 
> asynchroniusly.

I don't quite understand this part... For the G1RegionsSmallerThanCommitSizeMapper,
a page can be simultaneously requested to commit in VMThread to expand
heap and uncommit in concurrent thread to shrink heap. Looks like lowering
uncommit work to page level couldn't help this...

For the features you listed below,
   1) moving the memory uncommit into the concurrent phase
   2) uncommit at the end of (almost) every GC
   3) SoftMaxHeapSize

Since most of code is for the concurrent framework,
do you think 2) and 3) can be together and implemented first?
(The uncommit will happen immediately)

Thanks,
Liang

------------------------------------------------------------------
From:Thomas Schatzl <thomas.schatzl at oracle.com>
Send Time:2020 Jan. 13 (Mon.) 23:15
To:"MAO, Liang" <maoliang.ml at alibaba-inc.com>; hotspot-gc-dev <hotspot-gc-dev at openjdk.java.net>
Subject:Re: RFR: 8236073: G1: Use SoftMaxHeapSize to guide GC heuristics

Hi Liang,

   thanks for your contribution!

I looked through the change a bit and have a few comments.

What I noticed quickly after initial browsing through it is that this 
change implements three different features:
   1) moving the memory uncommit into the concurrent phase
   2) uncommit at the end of (almost) every GC
   3) SoftMaxHeapSize

These should be split across three separate changes (I already filed 
JDK-8236926 last week). No particular order I think, but the concurrent 
uncommit changes are probably hardest and will probably take most time.

Some additional initial comments:

- in G1Collectedheap::check_soft_max_heap_size_changed(), instead of the 
check for "AllocateOldGenAt != NULL" we probably want to ask the 
HeapRegionManager directly about whether it supports this. Also print 
some log message that it failed. Even on success, print a log message :)

- in that same method, I recommend first doing the alignment adjustment 
(which probably needs to be done for that suggested soft_max_capacity() 
method below too) and then check if it changed. That saves the repeated 
!= _prev_soft_max_heap_size check.

Actually, just using the suggested soft_max_heap_size() method should be 
fine.

- changes in G1CollectedHeap::resize_heap_if_necessary: please make the 
method to always use the soft max heap size as I do not understand why 
you would not do that.
I recommend adding a "soft_max_capacity()" method in G1CollectedHeap, 
and let that return MIN2(align_up(SoftMaxHeapSize, heap_alignment), 
max_capacity()).

There are a few places that check SoftMaxHeapSize validity (e.g. 
SoftMaxHeapSize <= capacity()), they could probably all be removed then.

- doing timing: you might have noticed, we are currently transitioning 
to use Ticks/Tickspan for points in time and durations at least for the 
calculation, so in any new code please avoid using os::elapsedTime().

Use:

Ticks start = Ticks::now();
// timed code
phase_times()->record....((Ticks::now() - start).seconds() * 1000.0);

instead.

- in G1CollectedHeap::shrink_heap_after_young_collection() I would 
prefer a structure like in expand_heap_after_collection, i.e.:

size_t shrink_bytes = _heap_sizing_policy->shrink_amount();
if (shrink_bytes  > 0) {
   // do actual preparation for shrinking
}

and put all that logic determining the amount to shrink in that 
shrink_amount() method.

- concurrent uncommit:
  - as mentioned, please split the related changes out from the other 
changes. This change is hard enough to get right as is.

- I would really prefer if we did not need to introduce another helper 
thread for uncommitting memory. Did you try using the 
G1YoungRemSetSamplingThread? I understand that uncommit might then delay 
young gen sampling, but I do not expect these events to occur all the 
time (but I have no reference here).
In the first implementation we could have another thread if others do 
not object, but every additional thread takes some time to startup and 
teardown, and memory for at least one stack page.

- please move the change in G1CollectedHeap::abort_concurrent_cycle() 
into a separate method - waiting for completion of the concurrent 
uncommit and the concurrent marking are completely different concerns.

  - I admit I haven't looked at all cases in detail, but the split in 
is_available() and is_unavailable_for_allocation() in HeapRegionManager 
seems incomplete and unncessary. Particularly because of bad naming, as 
the documentation for is_available() says it's actually 
is_available[_for_allocation]. Disregarding the negation, these two look 
equivalent with the problem that !is_available() != is_unavailable..., 
which is really bad style.

I have not found a case where it is harmful to not consider the 
_concurrent_resizing_map in is_available().

The split the state of a region between two bitmaps in HeapRegionManager 
(the available_map and the _concurrent_resizing_map) may be susceptible 
to tricky races. Please consider changing this to a real "state" as in 
"Available -> Removing -> Unavailable". This would make the code easier 
to read too. (And in the future "Adding" if required).

  - it should be possible to disable concurrent uncommit/resize via an 
experimental flag. Also there should be no concurrent resize thread if 
the Heapregionmanager does not support it.
G1 couold immediately do the heap change in that case.

The reason for this flag is to allow users too disable this if they 
experience problems.

  - not sure about why some methods in HeapRegionManager have "resizing" 
in their method name. As far as I can tell, the change only allows 
concurrent uncommit. Maybe use the above "remove" for regions instead of 
"uncommit" regions.

Background: The code and comments in HeapRegionManager are aware and 
fairly consistent (I hope) to not use the wording commit/uncommit for 
operations on HeapRegions. Only operations on memory pages should use 
committed/uncommit. The naming in the added methods does not respect that.

  - some of the methods (e.g. to find free regions) should inform the 
caller that there are to-be-removed regions to maybe retry after waiting 
for completion of that thread to avoid unexpected OOM.

  - I have a feeling that if the concurrent uncommit thread worked on 
pages, not regions, the code would be easier to understand. It would 
also solve the issue you asked about with the 
G1RegionsSmallerThanCommitSizeMapper. You may still need to pass region 
numbers anyway for logging, but otoh the logging could be done 
asynchroniusly.

  - s/parallely/concurrently a few times

  - there is a bug in the synchronization of the concurrent uncommit 
thread: it seems possible that the uncommit thread is still working 
(iterating over the list of regions to uncommit) while a completed 
(young) GC may add new regions to that list as young gcs do not wait for 
completion of the uncommit thread.

  - the concurrently uncommitted regions only become available for 
commit at the next gc, which seems very long. Why not make them 
available for commit "immediately"?

Related to that is the use of par_set/clear_bit in e.g. the available 
bitmap: since all par/clear actions are asserted to be in the vm thread 
at a safepoint, there does not seem to be a need for using the parallel 
variants of set/clear bit (if keeping the current mechanism).

  - please document the supposed interactions and assumptions like in 
the above two paragraphs between the "resize" thread and the other 
threads and safepoints.

  - please use the existing HeapRegionManager::shrink_by() 
method+infrastructure for passing a shrink request to the HRM, either 
immediately shrinking the heap or deferring for later shrinking 
(probably controlled by a flag) instead of adding new methods for the 
same purpose (with mostly the same contents).

E.g. there is a lot of code duplication in the new code in 
HeapRegionManager, particularly the combination of 
HeapRegionManager::concurrent_uncommit_regions_memory, 
HRM::synchronize_uncommit_regions_memory and HRM::uncommit_regions could 
probably be cut to almost 1/3rd.

On 13.01.20 12:45, Thomas Schatzl wrote:
> Hi Liang,
> 
> On 07.01.20 17:33, Liang Mao wrote:
>> Hi Thomas,
>>
>> As we previously discussed, I use the concurrent heap uncommit/commit 
>> mechanism to implement the SoftMaxHeapSize for G1. It is also for 
>> thfurther implementation of
>> G1ElasticHeap for ergonomic
>> change of heap size. In the previous 8u implementation, we had some 
>> limitations which are all
>> removed now in this patch. The concurrent uncommit/commit can also 
>> work with some senarios for
>> immediate heap expansion.
>>
>> Here is the webrev link:
>> http://cr.openjdk.java.net/~luchsh/8236073.webrev/
>>
>> We still have some questions.
>> 1. Does the SoftMaxHeapSize limitation need to consider the GC time 
>> ratio as in
>> expand_heap_after_young_collection? Now we haven't put the logic in yet.

I am not completely clear what you are asking about, but the gc time 
ratio only affects the current "optimal" heap size which is bounded by 
SoftMaxHeapsize/max_capacity.

>> 2. The concurrent uncommit/commit can only work for 
>> G1RegionsLargerThanCommitSizeMapper but not
>> G1RegionsSmallerThanCommitSizeMapper which might need some locks to 
>> ensure the multi-thread
>> synchronization issue( heap may expand immediately). I think bringing 
>> the lock synchronization
>> may not be worthy for the little gain. Another idea is can we just not 
>> uncommit the pages of
>> auxiliary data if in G1RegionsSmallerThanCommitSizeMapper? Heap 
>> regions should not be
>> G1RegionsSmallerThanCommitSizeMapper most of time I guess...
>>
>> Looking forward to your advice:)

Thanks,
   Thomas