Re: RFR: 8236073: G1: Use SoftMaxHeapSize to guide GC heuristics

Thu Feb 6 12:27:09 UTC 2020

Hi Thomas,

Thanks for the testing and evaluating!

I tried your test with specjbb2015 and had some little different
result maybe because of machine capability. The config I used is as below:
-Xmx8g -Xms2g -Xlog:gc* -XX:GCTimeRatio=4
-XX:+UseStringDeduplication
-Dspecjbb.comm.connect.type=HTTP_Jetty
-Dspecjbb.controller.type=PRESET
-Dspecjbb.controller.preset.ir=5000
-Dspecjbb.controller.preset.duration=10800000

The heap was around 6GB after running for a while (300s). And
I was able to use SoftMaxHeapSize to let it shrink to 5GB. It
should be like your scenario to shrink the heap to 3GB.

The behavior is as I expected. But I thought you might expect 
more aggressive result. In my mind, for a constant load,
the jvm might not need to shrink the heap that JVM supposes to expand
the heap to the right capacity. The soft limit I imagine is 
to bring the heap size down after a load pike. In Alibaba's
workload, the heap shrink is controlled by cluster's unified
 control center which has the predicition data and the soft limit
works more like a *hard* limit in our 8u implementation.
So I think it is acceptable that heap size failed shrinked
 to 2GB in your test case. You can see that
G1HeapSizingPolicy::can_shrink_heap_size_to is a bit conservative
and we may be able to make it more aggressive.

For almost idle application which doesn't have a GC for a 
rather long time, the shrink cannot happen. In our previous 8u
patch, we have a timer to trigger GC and the softmx is changed by
a jcmd which will also trigger a GC(there was no SoftMaxHeapSize option
in 8u yet). Shall we introduce a timer GC as well?

Honestly, I don't think Min/MaxHeapFreeRatio is a good way to detemine
the heap expand/shrink in G1 and in our 8u practical experience we never
have full GC so Min/MaxHeapFreeRatio is useless. Here when I reproduce
your test, the only exception is the heap will expand to 6GB after
shrinking to SoftMaxHeapSize=5g is because in remark we will resize the heap.
BTW, I don't think remark is a good point to resize heap since in remark
phaseregions full of garbage havn't been reclaimed yet. IMHO we even don't
need to resize in remark but just resize after mixed GC according to GCTimeRatio.

Your change to make SoftMaxHeapSize sensible in adaptive IHOP controlling
 seems a similar approach as ZGC. ZGC is a single generation GC whose scenario
 is much simpler. Maybe we don't need SoftMaxHeapSize to guide GC decision
 in G1. Since we already have policy to determine the shrink of the heap
 by SoftMaxHeapSize, I'm not sure if we need to make adaptive IHOP according
 to SoftMaxHeapSize... We may encounter the situation that we cannot shrink the 
heap size to SoftMaxHeapSize but concurrent mark become frequent after affecting
the IHOP policy.

> In the log I have, the problem seems to be that we are re-setting the 
> softmaxheapsize within the space reclamation phase (i.e. mixed gc) and 
> G1 sizing policies got confused, i.e. it partially keeps on using the 2g 
> goal for young gen sizing until the *2 problem expands it. That's a bug 
> and needs to be fixed.

I don't think it's a problem that after mixed GC resize_heap_after_young_collection
will evaluate if the heap can be shrinked to the new value of SoftMaxHeapSize.

Thanks,
Liang

------------------------------------------------------------------
From:Thomas Schatzl <thomas.schatzl at oracle.com>
Send Time:2020 Feb. 5 (Wed.) 16:14
To:"MAO, Liang" <maoliang.ml at alibaba-inc.com>; hotspot-gc-dev <hotspot-gc-dev at openjdk.java.net>
Subject:Re: RFR: 8236073: G1: Use SoftMaxHeapSize to guide GC heuristics

Hi Liang,

   apologies for the late reply - I did look at the patch immediately 
after you posted it, but initial tests showed that it does not work as 
(I) expected. More about that below. So I went ahead and hack up 
something that comes closer to what I had in mind. Unfortunately other 
more urgent issues came up, which caused the delay on this work. Sorry. 
(And sorry for the long post).

Not having any kind of workload to work with for testing the change I 
used some configuration of specjbb2015 with fixed ir [0] (taken from a 
colleague's unrelated recent internal test), simulating a constant load 
the user wants to control the heap usage of.

In this situatio I want to apologize to use specjbb2015 for this public 
reply because it not openly available, but I only noticed when writing 
up this email. Finding a substitute and redoing measurements would 
probably take more time. I will start looking into this issue.

Anyway, in my test scenario, after warmup, the user tries to first limit 
the heap to 2GB, and after a while to 3GB, and then back to 8GB.

The resulting graph [1] shows heap metrics over time: blue ("soft") is 
the current SoftMaxHeapSize, pink ("committed") represents committed 
memory, yellow ("goal")  shows G1's current heap size goal, turquoise 
("free") the amount of free heap and purple ("used") the amount of used 
memory.

Ignoring the drop from ~second 30-100 where I finally managed to set 
Min/MaxHeapFreeRatio ;) you can see  that G1 kind of stabilizes at 
around 3.8GB heap; at ~second 410 the softmaxheapsize soft is set to 
2GB. As you can see, G1 ignores the request. This corresponds to the 
code where apparently the heap is only reduced to SoftMaxHeapSize if 
there is enough free space to reduce to that value (I think).

At ~second 620 I set SoftMaxHeapSize to 3GB which gives the expected 
drop in memory usage. However, since the change does not modify G1 goals 
it ultimately just ignores the SoftMaxHeapSize goal. It probably worked 
if there were no further application activity.

I created a webrev of an alternative attempt that modifies G1's 
goal/target heap size in the adaptive IHOP mechanism so that G1 
automatically starts marking so that a space reclamation phase starts 
before reaching softmaxheapsize. It basically changes the predictor's 
reserve according to current committed heap size not only based on 
G1ReservePercent, but also on the specified SoftMaxHeapSize.

One complication in a generational setting is to adapt young gen 
(particularly survivor size) to that goal too, but I think the change 
does okay with that.

However it is not finished yet, there is debugging code in it and one 
FIXME that is about shuffling around code properly.

In the graph at [3] you can see the results, with same metrics shown. In 
this case G1 fairly well follows the soft goal.

For the 2g softmaxheapsize goal it works perfectly in the example (*1), 
in the 3g softmaxheapsize change we get some initial short overshoot in 
committed memory. (*2/*3)

There are however some problems/differences to your solution here which 
need to be discussed a bit more to see if it fits you and ultimately 
make it perform better:

*0 this change uses existing sizing to uncommit memory, i.e. memory is 
not uncommitted immediately but part of regular operation. This means 
that the garbage collection cycle needs to advance. In case of specjbb 
with fixed IR this is no issue, but completely quiescent applications 
need other mechanisms like the "Promptly Return Unused Committed Memory 
(JEP 346) feature enabled. Some tuning is needed in that mechanism for 
almost-idle applications.

*1 the problem with only setting SoftMaxHeapSize and relying on the 
regular uncommit mechanism is that due to other reasons, e.g. 
GCTimeRatio, G1 won't achieve this kind of compact heap. This is the 
reason why my setup includes the GCTimeRatio=4 on the command line - 
otherwise in neither case G1 would achieve the 2g goal (it would settle 
around 3g with my changes, didn't test the original changes; max heap 
usage would be ~5.8GB without SoftMaxHeapSize fyi), and you can't modify 
it during runtime (i.e. when you want to select a different 
throughput/latency tradeoff to achieve lower heap usage).

*2 looking at the results more closely the (first) overshoot in the 3g 
soft max heap size goal, I think this is a remaining issue in the heap 
sizing policy in conjunction with soft max heap size, i.e. temporarily 
the target gctimeratio is set to 10% for various reasons. (in 
G1HeapSizingPolicy::expansion_amount()).

In the log I have, the problem seems to be that we are re-setting the 
softmaxheapsize within the space reclamation phase (i.e. mixed gc) and 
G1 sizing policies got confused, i.e. it partially keeps on using the 2g 
goal for young gen sizing until the *2 problem expands it. That's a bug 
and needs to be fixed.

So far previous text only looked at the best case where everything fits 
together; there are some other issues which will prevent you from 
achieving a tight heap in some cases that I noticed during my testing. 
Something to think about.

*4 GCTimeRatio/heap expansion during young gc has different goals than 
the (un-)commit at the end of full gc. In some cases, with 
SoftMaxHeapSize (but also without), the later will undo the expansion at 
young gc, which will immediately start to expand again.

*5 GCTimeRatio can't be adjusted during runtime, which means that you 
won't achieve that tight of a heap as in this example. GCTimeRatio is 
also a bit unwieldy to use, i.e since it is the denominator in the 
(default; nobody sets GCPauseIntervalMillis) time calculation, you get 
"good" granularity of low values, but pretty bad granularity of high values.

*6 Min/MaxHeapFreeRatio default values are probably too high - with 
adaptive IHOP, G1 can typically meet its current goal very well, any 
excess is often just wasted committed memory. A similar issue to that 
is, don't set Min/MaxHeapFreeRatio to something below G1ReservePercent, 
i.e. the default reserve for the IHOP. In this case there will be 
significant memory commit/uncommit pauses.

Here is my question to you (and any readers), are you using 
Min/MaxHeapFreeRatio? Using SoftMaxHeapSize to set a target heap size 
seems to be much more direct and better than Min/MaxHeapFreeRatio. Given 
above (and assuming that there are no reasons to keep it), it may be 
useful to start deprecation process (at least for the use in G1) when 
SoftMaxHeapSize is in.

There are some more issues with heap sizing not really relevant to this 
discussion, I need to think about them a bit more and file appropriately 
worded CRs.

Either way, what do you think about my suggested change? Can you try it 
on your workloads to see if it could do the job? Any other comments?

More work is needed on this patch I think; also we might need to think 
about how the user can detect this change of the target better in the 
logs for troubleshooting.

The original patch (webrev.2) also contained some minor unrelated 
cleanups (one constification of a method, one rename of the heap 
resizing phase) that might be easier to address separately more quickly ;)

Thanks,
   Thomas

[0] specjbb2015 settings: -Dspecjbb.comm.connect.type=HTTP_Jetty 
-Dspecjbb.controller.type=PRESET -Dspecjbb.controller.presett.ir=5000 
-Dspecjbb.controller.preset.duration=10800000"
VM settings: -Xms2g -Xmx8g -XX:GCTimeRatio=4 -XX:+UseStringDeduplication

This gives ~1.5GB live set size, on my machine around 10-40ms pause 
time, so rather light load at least without setting any heap size goal; 
in my runs, G1 settles to around 3.8GB of committed heap. (with 
Min/MaxHeapFreeRatio=10 set after startup, but you can just put it into 
the VM startup options too)

[1] http://cr.openjdk.java.net/~tschatzl/8236073/softmaxheapsize-alibaba.png

[2] http://cr.openjdk.java.net/~tschatzl/8236073/webrev/

[3] http://cr.openjdk.java.net/~tschatzl/8236073/softmaxheapsize.png