Status of JEP-8204088/JDK-8236073

Wed Jun 9 17:56:31 UTC 2021

Hi Thomas,

Thanks for the feedback!

> Fwiw, in my opinion the intention of SoftMaxHeapSize has been more to
> account for external user requirements not caught by the internal gc
> load, not that gc load should guide SoftMaxHeapSize (and override it)
> directly. I.e. as an orthogonal consideration for heap sizing.

Yes. This should be the case if the user has set SoftMaxHeapSize explicitly.
We are actually considering two use cases that will be built on top of the
work
of SoftMaxHeapSize and GCTimeRatio (or GCCpuRatio), and they both
relieve users from setting SoftMaxHeapSize (and/or Xmx) by themselves.

1. Container RAM limit is fixed. In this case, the goal is to keep total
container usage
within the limit. If the usage is reaching the limit, the JVM could observe
the current
CPU overhead. If the overhead is not too high, it can automatically set a
lower
SoftMaxHeapSize to keep total container usage within the limit.

2. Container RAM limit can grow or shrink automatically. This is probably
unique in our
production environment. The goal in this case is to make the JVM use as
much RAM
as it needs, but not use too much that leads to memory waste. Ideally this
can be
achieved by allowing Xmx=unlimited (JDK-4408373), then make the JVM respect
GCTimeRatio or GCCpuRatio better, so it does not grow the heap too much.
In practice, this can be achieved by setting a very large Xmx, and making
the JVM
respect GCTimeRatio or GCCpuRatio better. (This use case may not require
SoftMaxHeapSize after all.)

> To a large degree I think that pause time has (historically) been just a
> more convenient to calculate (cross OS and everything) and a fairly
> accurate substitute for GC cpu overhead.

In my experience in JDK 11+G1, pause overhead could diverge significantly
from CPU overhead. I've seen cases where pause overhead is ~2%, but CPU
overhead is >50%, e.g., due to problems with humongous allocations
(perhaps already fixed by JDK-8245511 and JDK-8240556).

> Although I agreed above, there may be value in adding a new flag anyway:
> GCTimeRatio is fairly clumsy to use (i.e. GCCpuRatio = 1 / (1 +
> GCTimeRatio)). At least we should make it a floating point value....

Regarding whether to change the meaning of GCTimeRatio or adding a
GCCpuRatio, I was a bit concerned about what happens if the user has already
set some value for GCTimeRatio for G1.
I searched our repo and found less than 10 jobs setting GCTimeRatio, and
most
of them are for non-G1 collectors. The cases for setting it with G1 seem
unnecessary and can be removed.
So now I think we can make significant changes to the meaning of
GCTimeRatio for G1,
as it is not that effective with all the unresolved issues anyway.
Also agree that GCTimeRatio is clumsy to use.

How about we introduce a new flag like GCCpuPercentage similar to
MaxRAMPercentage
from JDK-8186248? Then we can make GCTimeRatio a no-op flag for G1.

-Man

On Wed, Jun 9, 2021 at 2:15 AM Thomas Schatzl <thomas.schatzl at oracle.com>
wrote:

> Hi Jonathan,
>
> On 08.06.21 01:16, Jonathan Joo wrote:
> > Hi Thomas,
> >
> >
> > I took some time to read through the bugs related to GCTimeRatio.
> >
> >
> > I think GCTimeRatio *may* work for this purpose, if all of the relevant
> > open issues are addressed. Like you mentioned in your email, I was
> > indeed able to repro the fact that even when GCTimeRatio is set to
> > aggressive levels (i.e. GCTimeRatio=1), too much of the heap is still
> > allocated. So fixing the related bugs may definitely help here, and I'll
> > experiment more with your proposed fixes.
>
> Okay, thanks for giving them a try.
>
> > Furthermore, I'd like to also
> > investigate how well SoftMaxHeapSize works at keeping heap usage within
> > the limit - you mentioned in your earlier email that the heap sizing
> > issues have been addressed but I wasn't sure of the exact status of
> > that. I'll patch your changes at
> >
> https://github.com/tschatzl/jdk/tree/8238687-investigate-memory-uncommit-during-young-gc2
> to
> > get a firsthand idea.
>
> Summing it up, the current available patches are:
>
> JDK-8238687 and JDK-8253413: improves (re-)sizing policy and acts on
> that at any young gc:
>
> https://github.com/tschatzl/jdk/tree/8238687-investigate-memory-uncommit-during-young-gc2
> JDK-8248324
> <https://github.com/tschatzl/jdk/tree/8238687-investigate-memory-uncommit-during-young-gc2JDK-8248324>:
> removes heap resizing at remark, which used a completely
> different policy anyway. Full gc is still an issue, but "it should not
> happeen". Patch attached to CR.
> JDK-8236073: implements SoftMaxHeapSize, patch attached to CR.
>
> Not sure if they address all of the heap sizing issues, but in my tests
> with them, heap size is much more stable and following GCTimeRatio more
> closely.
>
> There may be need to reconsider GCTimeRatio default value after these
> changes, idk.
>
> >
> >
> > However, one consideration against GCTimeRatio is that GCTimeRatio
> > relies on GC pause times, whereas ideally we can use total CPU overhead.
> > (The latter would be able to incorporate time spent by concurrent GC
> > worker threads, which may be constantly doing work in the background. As
> > far as I understand, this is not necessarily reflected in pause times.)
>
> GCTimeRatio has been introduced with pure STW garbage collectors. At
> that time getting per-process cpu measurements might have been more
> complicated too, and less cpu sharing common.
>
> So some modification of the meaning of "GCTime" for partially concurrent
> GCs may be apropriate.
>
> > Thus, I believe there are slight differences there which make CPU
> > overhead a more accurate measurement of "load" than GC pause times (at
> > least, for the use case we anticipate here at Google).
>
> >
> >
> > We already have developed some internal patches which allow us to
> > compute GC CPU overhead, so using this metric to influence
> > SoftMaxHeapSize shouldn't be too much of a problem for us. Given that we
> > have this information:
>
> Fwiw, in my opinion the intention of SoftMaxHeapSize has been more to
> account for external user requirements not caught by the internal gc
> load, not that gc load should guide SoftMaxHeapSize (and override it)
> directly. I.e. as an orthogonal consideration for heap sizing.
>
> Of course, GCTimeRatio (or GCCpuRatio) ultimately also determines some
> kind of heap size goal, and both SoftMaxHeapSize and that goal
> determined by GCTimeRatio need to be consolidated into a single value -
> after all there can be only one actual heap size in the VM :)
>
> Afaik the current thinking is that that ultimate heap size goal should
> be something like
>
>    min(SoftMaxHeapSize, Goal-set-by-GCTimeRatio)
>
> >  1.
> >
> >     Do you see any benefit to using pause times to determine
> >     SoftMaxHeapSize rather than CPU overhead? Is one more viable than
> >     the other?
>
> To a large degree I think that pause time has (historically) been just a
> more convenient to calculate (cross OS and everything) and a fairly
> accurate substitute for GC cpu overhead.
>
> >
> >  2.
> >
> >     Do you think there is value in modifying GCTimeRatio to measure CPU
> >     overhead rather than pause times?
>
> This is just my opinion, but yes.
>
> >
> >  3.
> >
> >     If not, would it be helpful to still introduce this functionality
> >     into the JVM, perhaps as a new JVM flag like `GCCpuRatio`? (So as to
> >     not collide with GCTimeRatio's existing functionality.)
>
> Although I agreed above, there may be value in adding a new flag anyway:
> GCTimeRatio is fairly clumsy to use (i.e. GCCpuRatio = 1 / (1 +
> GCTimeRatio)). At least we should make it a floating point value....
>
> Thanks,
>    Thomas
>