RFR (M): 8212657: Implementation of JDK-8204089 Timely Reduce Unused Committed Memory

Mon Nov 19 13:14:20 UTC 2018

Hi Man,

On Sat, 2018-11-17 at 00:25 -0800, Man Cao wrote:
> Hi,
> 
> Thanks for the response! We discussed about this internally and our
> consensus is that we will gather more performance data from
> production workload and compare the tradeoff and effectiveness of the
> two approaches:
> (a) this JEP: setting Xms != Xmx and periodically triggering GC;
> (b) our local features: setting Xms = Xmx, and calling MADV_DONTNEED
> on some free regions, and use mutator utilization to trigger
> additional GCs.

Looking forward hearing back from you about this. Note that barring
unforeseen issues I would like to go ahead with this change for JDK12
as is.

We can always tweak details later.

[..]

> Other responses and context below.
> 
> > [...]
> 
> > There will be a lot of resistence to make -Xmx==-Xms behave as you
> > suggest (in the default case...), and it seems that the problem in
> > your case is improper heuristics for -Xms in some (many?) cases
> > which seems to be acknowledged above.
> 
> If the GC ever calls MADV_DONTNEED for Xms = Xmx, it will be guarded
> by a new flag. This flag should be turned off by default. In our
> local feature, the flag is called DeallocateHeapPages. I suspect this
> would require another JEP.

I suggest to talk about whether it needs a JEP when we are officially
discussing that feature.

JEPs have a rather high administrative overhead, so (imho) we should
keep them to "significant" features that you think are useful for some
"large" amount of users and you really want everyone to notice the
change. ;)

> > I am still not sure what the problem is with -Xms != -Xmx, or what
> > -Xmx==-Xms with following uncommit solves. It is hard to believe
> > for me that setting -Xms to -Xmx is easiest for an end user - I
> > would consider not setting -Xms easiest...
> > Maybe doing so improves startup time where often it is advantageous
> > to have a large eden to get the long-lived working set into old gen
> > quickly? Maybe some "startup boost" for heap sizing/some options
> > would help here much better?
> 
> Almost all GC tuning guidelines for server applications recommend
> setting Xms = Xmx.
> For example:

> https://docs.oracle.com/en/java/javase/11/gctuning/factors-affecting-garbage-collection-performance.html#GUID-B0BFEFCB-F045-4105-BFA4-C97DE81DAC5B
>

https://docs.oracle.com/middleware/12213/wls/PERFM/jvm_tuning.htm#PERFM160
> Thus most production services have set them to the same value.

Thank you for giving some documentation examples, although I am not
sure whether my conclusions when reading these would be the same.

The first link merely states that setting -Xms==-Xmx increases
predictability and lists some drawback.
(Also that part of the tuning guide is really old, we should fix it,
and while doing that we should try to improve the wording)

The text from the second link seems to be from the JDK6 era, mentioning
the Sun JVM JDK 1.6, for a particular product (Weblogic) that is out of
our direct control.

I am aware that there are a lot of magic incantations from generations
of users for optimal GC settings out there, however the official
current recommendation when using G1 is:

"The general recommendation is to use G1 with its default settings,
eventually giving it a different pause-time goal and setting a maximum
Java heap size by using -Xmx if desired. "

from 
https://docs.oracle.com/en/java/javase/11/gctuning/garbage-first-garbage-collector-tuning.html#GUID-0BB3B742-A985-4D5E-A9C5-433A127FE0F6

and our official recommendation when moving from CMS or other
collectors:

"Generally, when moving to G1 from other collectors, particularly the
Concurrent Mark Sweep collector, start by removing all options that
affect garbage collection, and only set the pause-time goal and overall
heap size by using -Xmx and optionally -Xms. "

https://docs.oracle.com/en/java/javase/11/gctuning/garbage-first-garbage-collector-tuning.html#GUID-E26056D1-02A5-4367-94EF-72C66D314AF7

A lot of options used for CMS (which really is the odd one) are either
meaningless or actively harmful for G1.

> From experience with CMS, setting a smaller Xms would increase
> startup time and GC overhead after startup.

That's why I was mentioning that special-handling startup wrt. heap
sizing might be an interesting idea.

> CMS could shrink and re-expand the heap over and over, causing
> unnecessary GC overhead.

The problem I see is that CMS would also shrink the young gen
aggressively as it is a fixed fraction of the currently available heap,
causing (much) more frequent GCs. G1 does not have that problem as much
as other collectors as the young gen is sized (within limits) according
to pause time only.

In addition to that MADV_DONTNEED as part of regular heap sizing would
probably help as it makes frequent shrinking and expanding cheap (if
the machine has enough memory).

> Basically the extra memory saving hardly justifies the extra GC
> overhead.
> The DeallocateHeapPages feature strikes a better balance between
> memory saving and overhead for reusing pages marked as MADV_DONTNEED.

> Perhaps the situation in G1 is better than in CMS, that Xms != Xmx
> does not cost much more GC overhead?

See above; also G1 sizes (increases) the heap more aggressively than
other collectors. Also there has been some heap sizing re-tuning during
the JDK9/10 timeframe, so maybe that helps too.

> Probably we'd need JDK-6490394 backported to JDK11 to have more
> memory saving for production services running JDK11?

Without JDK-6490394 memory savings are most often very small as G1 
only resizes the heap during full gc, so yes, I would think
without JDK-6490394 this change is not very useful.

> If Xms != Xmx and this JEP addresses the memory saving and GC
> overhead balance, we are happy to advise users set a smaller Xms or
> not set it at all for G1, and deprecate DeallocateHeapPages.
> 
> > That is a very non-standard way of defining mutator utilization,
> > but some of the terms are not clearly defined :)
> > From what I understand, the formula in the end just reduces to
> > periodic old gen collections regardless of other activity (e.g. it
> > does not take minor gc into account apparently).
> > ...
> 
> Discussion for mutator utilization will probably get initiated in a
> separate thread after we collect more production performance data.
> Wessam will be a better contact for mutator utilization.
> 
> > > It is orthogonal to G1UseAdaptiveIHOP to control when to start a
> > > concurrent cycle. We also found it is useful to reduce GC cost in
> > > production workload by setting a higher minimum bound to prevent
> > > concurrent cycles.
> > 
> > I did not get that paragraph, you need to explain this in more
> > detail :)
> 
> Mutator utilization considers frequency of concurrent collections,
> rather than heap occupancy.
> The second sentence is basically a case for this previous sentence:
> "If mutator utilization is too low (e.g., <40%), it can be used to
> prevent concurrent collection from happening."
> Concurrent collections could be too frequent or wasteful, for example
> JDK-8163579, and mutator utilization can prevent such cases.

With adaptive IHOP G1 should not start too many excess concurrent
cycles (there is some additional slack though, which can be configured
iirc), as it (in the steady state) by definition tries to find the
"last" possible point in time to start the marking.

The slack of current adaptive IHOP is rather on the conservative side
(high default slack), and there are sometimes significant throughput
gains to be made with manual IHOP tuning. Something that could be
improved directly in the heuristics :)

Otherwise, if the algorithm starts a significant amount of excess
concurrent cycles that in particular do not effect anything, then we
are most likely looking at an unusual situation.

In that case, again I would prefer first looking into improving the
heuristic instead of adding yet another loosely connected
mechanism (that can fail too, needs manual tuning for particular
applications, etc).

Which is btw the case in JDK-8163579: the adaptive IHOP never gets into
steady state, as it can never achieve enough valid measurements to do
so. Which means it tries to repeat these measurements (to my taste) too
often.

A simple, existing possibility to fix this is by increasing the initial
IHOP value manually :) The suggestion presented in the CR could detect
this situation and act accordingly without introducing another tunable
(i.e. in addition to the "slack" tunable for adaptive IHOP and the
mentioned initial IHOP starting value; or even disabling adaptive
IHOP).

The text may also sound a bit alarming, in reality the throughput
difference is low - the main problem during which this one has been
found (linked) was some bad interaction between some (by now fixed)
compiler optimization with an increased frequency of marking cycles.
The actual decrease in throughput due to this behavior is not that
large.

Also these observations were from an out-of-the-box run, i.e. no
options at all. Depending on your machine and other VM settings the
situation may not occur at all.

(I would need to re-test with an internal patch I (think I still) have
to see what it actually amounts to. It should be in the very low
single-digit range.)

Otherwise it would have been of higher priority and most likely already
been fixed. If you want to take a stab at it, I could try to dig out my
patch and provide it to you.

Thanks,
  Thomas