RFR: bug: Timely Reducing Unused Committed Memory

Tue Sep 25 14:48:55 UTC 2018

Thanks Ruslan for your input,

On 2018-09-21 15:35, Ruslan Synytsky wrote:
> Dear Stefan and Rodrigo, thank you for moving this forward.
> 
> ---------- Forwarded message ---------
>> From: *Stefan Johansson* <stefan.johansson at oracle.com 
>> <mailto:stefan.johansson at oracle.com>>
>> Date: quarta, 19/09/2018 à(s) 10:45
>> Subject: Re: RFR: bug: Timely Reducing Unused Committed Memory
>> To: <hotspot-gc-dev at openjdk.java.net 
>> <mailto:hotspot-gc-dev at openjdk.java.net>>, <rbruno at gsd.inesc-id.pt 
>> <mailto:rbruno at gsd.inesc-id.pt>>
>>
>>
>> Hi Rodrigo,
>>
>> I pasted your reply here to keep the discussion in one thread.
>>
>> >> I understand that it is hard to define what is idle. However, if we 
>> require the
>> >> user to provide one, I guess that most regular users that suffer 
>> from the problem
>> >> that this patch is trying to solve will simply not do it because it 
>> requires knowledge
>> >> and effort. If we provide an idle check that we think will benefit 
>> most users, then
>> >> we are probably helping a lot of users. For those that the default 
>> idle check is
>> >> not good enough, they can always disable this idle check and 
>> implement the idle
>> >> check logic it in an external tool.
>> >>
>> > I agree, if we can find a solution that benefits most users, we should
>> > do it. And this is why I would like to hear from more users if this
>> > would benefit their use cases. 
> I believe the default idle definition should be based on the major 
> bottlenecks: RAM, CPU and IO loads as well as the network. RAM - we try 
> to improve. IO - I’m not sure if we can measure IO load properly inside 
> JVM. If possible then it's good to add too. If not then we can skip it 
> for now, as it can be measured and triggered by outside logic. Network 
> is not involved in GC process, correct? So no need for that. CPU looks 
> the most obvious and already implemented, seems like a good option to 
> start from.

I agree that CPU can look obvious, but making decisions in the VM based 
on the system load might be hard. For example the avg load might be low 
while the current process is fairly active. Another question, when 
running in the cloud, what load is the user expecting us to compare 
against, the overall system or the local container. I'm actually not 
entirely sure what the getloadavg() call return in case of running in a 
container.

> 
>> > Another thing that I don't fully
>> > understand is why the flags are manageable if there isn't supposed 
>> to be
>> > some external logic that sets them?
> Some advanced users, for example cloud platform or software vendors, 
> will be able to apply an additional logic based on their custom needs / 
> specifics. Such flexibility enables more use cases and it helps to 
> collect more feedback for the further default improvements.

That's how I would expect it to be used as well, thanks for clarifying 
your viewpoint.

>>
>> >> We can also change the semantics of "idleness".  Currently it 
>> checks the load.
>> >> I think that checking the allocation rate might be another good 
>> option (instead of
>> >> load). The only corner case is  an application that does not 
>> allocate but consumes
>> >> a lot of CPU. For this case, we might only trigger compaction at 
>> most once because,
>> >> as it does not allocate memory, we will not get over committed 
>> memory (i.e., the other
>> >> checks will prevent it). The opposite is also possible (almost idle 
>> application that allocates
>> >> a lot of memory) but in this scenario I don't think we want to 
>> trigger an idle compaction.
>> >>
>>
>> > This is my main problem when it comes to determine "idleness", for some
>> > applications allocation rate will be the correct metric, for others it
>> > will be the load and for a third something different. It feels like it
>> > is always possible to come up with a case that needs something 
>> different.
> I would prefer to start with the most obvious one - based on CPU, give 
> it to try to more people by promoting the fact that JVM is elastic now, 
> and we will get more feedback that can be converted into an additional 
> logic later.
> 
So basically, the first version would have two flags, one to turn on 
periodic GCs (currently named GCFrequency) and one to control at which 
average load (MaxLoadGC) these GCs will kick in?

>> >> Having said that, I am open to change this flag or even remove it 
>> as it is one of the
>> >> hardest to get right.
>> >>
>>
>> > As I said before, to me it feels like just having a periodic GC 
>> interval
>> > flag that is manageable would be a good start. Maybe have constraint
>> > that the periodic GC only occurs if no other GCs have happened during
>> > the interval.
>>
> Decision based on the previous GC cycles is very good proposal. I think 
> we need to take it into account somehow, but I'm not so deep on it. 
> Input of others will be helpful here.

I guess there are corner cases in this area as well, but I guess the 
simple constraint I described might be a good start. But as you say, 
input from others would be very helpful.

>> > Could you explain how your use case would suffer from such
>> > limitations?
> In my opinion, CPU load spikes is clearly one of the major use cases 
> eligible for defaults.

This is clear and good use case where I guess having a load threshold 
should really help.

Thanks,
Stefan

> 
> Thank you
> 
>>
>> > Thanks,
>> > Stefan
>>
>> >> cheers,
>> >> rodrigo
>>
>>
>> On 2018-09-13 14:30, Stefan Johansson wrote:
>> > Hi Rodrigo,
>> >
>> > Sorry for being a bit late into the discussion. We've had some internal
>> > discussions and realized that there are some questions that I need to
>> > bring up here.
>> >
>> > I'm trying to better understand under what circumstances this 
>> feature is
>> > to be used and how a user should use the different flags to tweak it to
>> > their use case. To me it feels like GCFrequency would be enough to make
>> > sure that the VM returns memory on a timely basis. And if the flag is
>> > managed, it can be controlled to not do periodic GCs during high load.
>> > With that we get a good way to periodically try to reduce the committed
>> > heap.
>> >
>> > The reason I ask is because I have a hard time seeing how we can
>> > implement a generic policy for when the system is idle. A policy that
>> > will apply well to most use cases. For some cases having the flags you
>> > propose might be good, but for other there might be a different set of
>> > options needed. If this is the case then maybe the logic and policy of
>> > when to do this can live outside the VM, while the code to periodically
>> > do GCs lives within the VM. What do you think about that? I understand
>> > the problems you've stated with having the policy outside that VM, but
>> > at least we have more information to act on there.
>> >
>> > We know that many have asked for features similar to this one and it
>> > would be nice to get input from others on this to make sure we 
>> implement
>> > something that benefits the whole user base as much as possible. So
>> > anyone with a use case that could benefit from this, please chime in.
>> >
>> > Regards,
>> > Stefan
>> >
>> >
>> >
>> > On 2018-09-07 17:37, Rodrigo Bruno wrote:
>> >> Hi Per and Thomas,
>> >>
>> >> thank you for your comments.
>> >>
>> >> I think it is possible to implement this feature using the service
>> >> thread or using a separate thread.
>> >> I see some pros and cons of having a separate thread:
>> >>
>> >> Pros:
>> >> - using the service thread exposes something that is G1 specific to
>> >> the rest of the JVM.
>> >> Thus, using a separate thread, hides this feature from the outsite.
>> >>
>> >> Cons:
>> >> - Having a manageable timeout is a bit more tricky to implement in a
>> >> separate/dedicated thread.
>> >> We need to be able to handle switch on and off. It might require some
>> >> variable pooling.
>> >> - It requires some more memory.
>> >>
>> >> Regardless of the path taken, I can prepare a new version of the patch
>> >> whenever we decide on this.
>> >>
>> >> cheers,
>> >> rodrigo
>> >>
>> >> Per Liden <per.liden at oracle.com <mailto:per.liden at oracle.com> 
>> <mailto:per.liden at oracle.com <mailto:per.liden at oracle.com>>>
>> >> escreveu no dia sexta, 7/09/2018 à(s) 11:58:
>> >>
>> >>     Hi Thomas,
>> >>
>> >>     On 09/07/2018 10:10 AM, Thomas Schatzl wrote:
>> >>     [...]
>> >>      >    overnight I thought a bit of the implementation, and 
>> given the
>> >>      > problem with heap usage of the new thread, and the 
>> requirement of
>> >>     being
>> >>      > able to turn on/off that feature by a managed variable, the best
>> >>     change
>> >>      > would probably reusing the service thread as you did in the
>> >> initial
>> >>      > change.
>> >>
>> >>     I'm not convinced that this should be handled outside of G1. If
>> >> there's
>> >>     a need to have the flag manageable at runtime (is that really the
>> >>     case?), you could just always start the G1DetectIdleThread and
>> >> have it
>> >>     check the flag. I wouldn't worry too much about the memory
>> >> overhead for
>> >>     the stack.
>> >>
>> >>     cheers,
>> >>     Per
>> >>
> 
> 
> 
> -- 
> Ruslan
> CEO @ Jelastic <https://jelastic.com/>