From wajih.ahmed at gmail.com  Wed Sep 12 13:26:09 2018
From: wajih.ahmed at gmail.com (Wajih Ahmed)
Date: Wed, 12 Sep 2018 09:26:09 -0400
Subject: G1GC fine tuning under heavy load
Message-ID: <CABtOXeYnbMYFtgsRm=4-5fUNzjx4wLXY5t3XeXnahVngKD3TQg@mail.gmail.com>

Hello,

I have an application running on two nodes in a kubernetes cluster. It is
handling about 70 million requests per day.  I have noticed a gradual
decline in the throughput so much so that in about 7 days the througput
falls about 50%.  Although large percent of this decline is in the first
hour and then a gradual decline.
This graph
<https://drive.google.com/open?id=19pG4j2ezNj-jm69Br7HqGKtR_c7L_-r6> to
shows this pattern.  Some of the decline i can attribute to the application
and use case itself. As database starts growing rapidly the system come
under memory and cpu pressure and the database itself is also a java
application.  So perhaps ignoring the decline of the first hour is prudent
but i am still interested in seeing if i can tune the jvm of the app so
that the throughput is more linear after the first hour.

I am also providing a gceasy.io report
<https://drive.google.com/open?id=1s0akdn6ztj2-oeOHwEjFMqbDRpYOweJJ> that
will
give the required information about GC activity.  You will see i have done
some rudementary tuning already.

What i am curious about is if the young gen size needs to be reduced by
tunring G1NewSizePercent to reduce the duration of the pauses in particular
the object copy stage.

Secondly what GCEasy is calling "consecutive full gc" don't appear to be
full GC's.  But it might be CMS (initial-mark) activity which accouts for
most of the GC activity and has some long pause times.  Will increasing
InitiatingHeapOccupancyPercent be recommended to reduce this activity and
give the application more time?

Any other advise will be helpful as i start to learn and unfold the
mystries of GC tuning :-)

Just in case you don't want to open the pdf report these are my JVM args

-XX:G1MixedGCCountTarget=12 -XX:InitialHeapSize=7516192768
-XX:MaxGCPauseMillis=200 -XX:MaxHeapSize=7516192768 -
XX:MetaspaceSize=268435456 -XX:+PrintAdaptiveSizePolicy -XX:+PrintGC
-XX:+PrintGCDateStamps -XX:+PrintGCDetails -XX:+PrintGCTimeStamps
-XX:+PrintPromotionFailure -XX:+PrintTenuringDistribution
-XX:+UseCompressedClassPointers -XX:+UseCompressedOops -XX:+UseG1GC -XX:-
UseNUMA


Regards
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20180912/fbf7c333/attachment.html>

From david.weeda at det.nsw.edu.au  Thu Sep 13 00:12:51 2018
From: david.weeda at det.nsw.edu.au (David Weeda)
Date: Thu, 13 Sep 2018 00:12:51 +0000
Subject: G1GC fine tuning under heavy load
In-Reply-To: <CABtOXeYnbMYFtgsRm=4-5fUNzjx4wLXY5t3XeXnahVngKD3TQg@mail.gmail.com>
References: <CABtOXeYnbMYFtgsRm=4-5fUNzjx4wLXY5t3XeXnahVngKD3TQg@mail.gmail.com>
Message-ID: <fb0744ac2c244f6fa32306d3e11042ba@PW0991EX13M202.UC.DET.NSW.EDU.AU>

Hello Ahmed,

What we have done is following the documentation at
https://www.oracle.com/technetwork/articles/java/g1gc-1984535.html
We noted the following comment. To us it meant that ideally your HEAP size should be a power of 2 also. We settled on 2 x 8GB.

-XX:G1HeapRegionSize=n
Sets the size of a G1 region. The value will be a power of two and can range from 1MB to 32MB. The goal is to have around 2048 regions based on the minimum Java heap size.
From our java trace
J  [0.013s][info   ][gc,heap] Heap region size: 4M
J  [0.128s][info   ][gc     ] Using G1

Regards,
David Weeda
SAP Technical Architect

From: hotspot-gc-use <hotspot-gc-use-bounces at openjdk.java.net> On Behalf Of Wajih Ahmed
Sent: Wednesday, 12 September 2018 11:26 PM
To: hotspot-gc-use at openjdk.java.net
Subject: G1GC fine tuning under heavy load

Hello,

I have an application running on two nodes in a kubernetes cluster. It is handling about 70 million requests per day.  I have noticed a gradual decline in the throughput so much so that in about 7 days the througput falls about 50%.  Although large percent of this decline is in the first hour and then a gradual decline.
This graph<https://drive.google.com/open?id=19pG4j2ezNj-jm69Br7HqGKtR_c7L_-r6> to shows this pattern.  Some of the decline i can attribute to the application and use case itself. As database starts growing rapidly the system come under memory and cpu pressure and the database itself is also a java application.  So perhaps ignoring the decline of the first hour is prudent but i am still interested in seeing if i can tune the jvm of the app so that the throughput is more linear after the first hour.

I am also providing a gceasy.io<http://gceasy.io> report<https://drive.google.com/open?id=1s0akdn6ztj2-oeOHwEjFMqbDRpYOweJJ> that will
give the required information about GC activity.  You will see i have done some rudementary tuning already.

What i am curious about is if the young gen size needs to be reduced by tunring G1NewSizePercent to reduce the duration of the pauses in particular the object copy stage.

Secondly what GCEasy is calling "consecutive full gc" don't appear to be full GC's.  But it might be CMS (initial-mark) activity which accouts for most of the GC activity and has some long pause times.  Will increasing InitiatingHeapOccupancyPercent be recommended to reduce this activity and give the application more time?

Any other advise will be helpful as i start to learn and unfold the mystries of GC tuning :-)

Just in case you don't want to open the pdf report these are my JVM args

-XX:G1MixedGCCountTarget=12 -XX:InitialHeapSize=7516192768 -XX:MaxGCPauseMillis=200 -XX:MaxHeapSize=7516192768 - XX:MetaspaceSize=268435456 -XX:+PrintAdaptiveSizePolicy -XX:+PrintGC -XX:+PrintGCDateStamps -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintPromotionFailure -XX:+PrintTenuringDistribution -XX:+UseCompressedClassPointers -XX:+UseCompressedOops -XX:+UseG1GC -XX:- UseNUMA


Regards

**********************************************************************
This message is intended for the addressee named and may contain
privileged information or confidential information or both. If you
are not the intended recipient please delete it and notify the sender.
**********************************************************************
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20180913/2f5d9d81/attachment.html>

From wolfgang.pedot at finkzeit.at  Thu Sep 13 07:21:59 2018
From: wolfgang.pedot at finkzeit.at (Wolfgang Pedot)
Date: Thu, 13 Sep 2018 09:21:59 +0200
Subject: G1GC fine tuning under heavy load
In-Reply-To: <CABtOXeYnbMYFtgsRm=4-5fUNzjx4wLXY5t3XeXnahVngKD3TQg@mail.gmail.com>
References: <CABtOXeYnbMYFtgsRm=4-5fUNzjx4wLXY5t3XeXnahVngKD3TQg@mail.gmail.com>
Message-ID: <68b777a3-e416-3176-280c-2f8354f78d43@finkzeit.at>

Hello,

Am 12.09.2018 um 15:26 schrieb Wajih Ahmed:
> What i am curious about is if the young gen size needs to be reduced 
> by tunring G1NewSizePercent to reduce the duration of the pauses in 
> particular the object copy stage.


What I have learned tuning G1 is that it is usually best to trust G1s 
ergonomics and not directly influence the size allocated to the 
different generations. What you can do to test with smaller NewGen is 
reduce the pause-time.

 From your report I would say that the 200ms you currently have 
configured are met pretty much all of the time so the NewGen is sized 
correctly.


regards

Wolfgang


From stefan.johansson at oracle.com  Thu Sep 13 12:39:33 2018
From: stefan.johansson at oracle.com (Stefan Johansson)
Date: Thu, 13 Sep 2018 14:39:33 +0200
Subject: G1GC fine tuning under heavy load
In-Reply-To: <CABtOXeYnbMYFtgsRm=4-5fUNzjx4wLXY5t3XeXnahVngKD3TQg@mail.gmail.com>
References: <CABtOXeYnbMYFtgsRm=4-5fUNzjx4wLXY5t3XeXnahVngKD3TQg@mail.gmail.com>
Message-ID: <c0f3a5af-189b-056a-a771-6adf8bcaa660@oracle.com>

Hi,

Could you please provide the GC logs from the run as well, the reports 
give a good overview but some details from the logs might help us give 
better advise. It will also help to rule out that there are no Full GCs 
occurring as you say. Some more comments inline.

On 2018-09-12 15:26, Wajih Ahmed wrote:
> Hello,
> 
> I have an application running on two nodes in a kubernetes cluster. It 
> is handling about 70 million requests per day.? I have noticed a gradual 
> decline in the throughput so much so that in about 7 days the througput 
> falls about 50%.? Although large percent of this decline is in the first 
> hour and then a gradual decline.
> This graph 
> <https://drive.google.com/open?id=19pG4j2ezNj-jm69Br7HqGKtR_c7L_-r6> to 
> shows this pattern.? Some of the decline i can attribute to the 
> application and use case itself. As database starts growing rapidly the 
> system come under memory and cpu pressure and the database itself is 
> also a java application.? So perhaps ignoring the decline of the first 
> hour is prudent but i am still interested in seeing if i can tune the 
> jvm of the app so that the throughput is more linear after the first hour.
> 
> I am also providing a gceasy.io <http://gceasy.io> report 
> <https://drive.google.com/open?id=1s0akdn6ztj2-oeOHwEjFMqbDRpYOweJJ> 
> that will
> give the required information about GC activity.? You will see i have 
> done some rudementary tuning already.
> 
> What i am curious about is if the young gen size needs to be reduced by 
> tunring G1NewSizePercent to reduce the duration of the pauses in 
> particular the object copy stage.

This is a very hard question and answer, a smaller young gen of course 
mean less regions to collect but since the GCs will occur more 
frequently, less objects will have time to die, so it might be that a 
larger young gen is quicker to collect for some applications. And since 
long pause times doesn't seem to be the biggest problem, I wouldn't 
start the tuning here.

> 
> Secondly what GCEasy is calling "consecutive full gc" don't appear to be 
> full GC's.? But it might be CMS (initial-mark) activity which accouts 
> for most of the GC activity and has some long pause times.? Will 
> increasing InitiatingHeapOccupancyPercent be recommended to reduce this 
> activity and give the application more time?
> 

Looking at the report it looks like the old generation grows over time 
and it might be that a lot of it is live so the concurrent cycles don't 
free up that much and that you still are above the limit afterwards. If 
this is the case setting a higher InitiatingHeapOccupancyPercent could 
help.

Would also be helpful to know what version of Java you are running.

Cheers,
Stefan

> Any other advise will be helpful as i start to learn and unfold the 
> mystries of GC tuning :-)
> 
> Just in case you don't want to open the pdf report these are my JVM args
> 
> -XX:G1MixedGCCountTarget=12 -XX:InitialHeapSize=7516192768 
> -XX:MaxGCPauseMillis=200 -XX:MaxHeapSize=7516192768 - 
> XX:MetaspaceSize=268435456 -XX:+PrintAdaptiveSizePolicy -XX:+PrintGC 
> -XX:+PrintGCDateStamps -XX:+PrintGCDetails -XX:+PrintGCTimeStamps 
> -XX:+PrintPromotionFailure -XX:+PrintTenuringDistribution 
> -XX:+UseCompressedClassPointers -XX:+UseCompressedOops -XX:+UseG1GC 
> -XX:- UseNUMA
> 
> 
> Regards
> 
> 
> _______________________________________________
> hotspot-gc-use mailing list
> hotspot-gc-use at openjdk.java.net
> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
> 

From wajih.ahmed at gmail.com  Thu Sep 13 21:34:20 2018
From: wajih.ahmed at gmail.com (Wajih Ahmed)
Date: Thu, 13 Sep 2018 17:34:20 -0400
Subject: G1GC fine tuning under heavy load
In-Reply-To: <c0f3a5af-189b-056a-a771-6adf8bcaa660@oracle.com>
References: <CABtOXeYnbMYFtgsRm=4-5fUNzjx4wLXY5t3XeXnahVngKD3TQg@mail.gmail.com>
 <c0f3a5af-189b-056a-a771-6adf8bcaa660@oracle.com>
Message-ID: <CABtOXeaGCrLLKhE0uo-OK+CsTwx_dpwJmObm6d=e+dN3i2fsNQ@mail.gmail.com>

The link to the file is
https://drive.google.com/open?id=1Y8ZvxxHy078xPjJFy26nMhxeTaWgIcwn.  And i
hope i have the correct file corresponding to the report but even if it is
not it will exhibit the same pattern as it could be from one of the servers
in the cluster.

Regards,

On Thu, Sep 13, 2018 at 8:40 AM Stefan Johansson <
stefan.johansson at oracle.com> wrote:

> Hi,
>
> Could you please provide the GC logs from the run as well, the reports
> give a good overview but some details from the logs might help us give
> better advise. It will also help to rule out that there are no Full GCs
> occurring as you say. Some more comments inline.
>
> On 2018-09-12 15:26, Wajih Ahmed wrote:
> > Hello,
> >
> > I have an application running on two nodes in a kubernetes cluster. It
> > is handling about 70 million requests per day.  I have noticed a gradual
> > decline in the throughput so much so that in about 7 days the througput
> > falls about 50%.  Although large percent of this decline is in the first
> > hour and then a gradual decline.
> > This graph
> > <https://drive.google.com/open?id=19pG4j2ezNj-jm69Br7HqGKtR_c7L_-r6> to
> > shows this pattern.  Some of the decline i can attribute to the
> > application and use case itself. As database starts growing rapidly the
> > system come under memory and cpu pressure and the database itself is
> > also a java application.  So perhaps ignoring the decline of the first
> > hour is prudent but i am still interested in seeing if i can tune the
> > jvm of the app so that the throughput is more linear after the first
> hour.
> >
> > I am also providing a gceasy.io <http://gceasy.io> report
> > <https://drive.google.com/open?id=1s0akdn6ztj2-oeOHwEjFMqbDRpYOweJJ>
> > that will
> > give the required information about GC activity.  You will see i have
> > done some rudementary tuning already.
> >
> > What i am curious about is if the young gen size needs to be reduced by
> > tunring G1NewSizePercent to reduce the duration of the pauses in
> > particular the object copy stage.
>
> This is a very hard question and answer, a smaller young gen of course
> mean less regions to collect but since the GCs will occur more
> frequently, less objects will have time to die, so it might be that a
> larger young gen is quicker to collect for some applications. And since
> long pause times doesn't seem to be the biggest problem, I wouldn't
> start the tuning here.
>
> >
> > Secondly what GCEasy is calling "consecutive full gc" don't appear to be
> > full GC's.  But it might be CMS (initial-mark) activity which accouts
> > for most of the GC activity and has some long pause times.  Will
> > increasing InitiatingHeapOccupancyPercent be recommended to reduce this
> > activity and give the application more time?
> >
>
> Looking at the report it looks like the old generation grows over time
> and it might be that a lot of it is live so the concurrent cycles don't
> free up that much and that you still are above the limit afterwards. If
> this is the case setting a higher InitiatingHeapOccupancyPercent could
> help.
>
> Would also be helpful to know what version of Java you are running.
>
> Cheers,
> Stefan
>
> > Any other advise will be helpful as i start to learn and unfold the
> > mystries of GC tuning :-)
> >
> > Just in case you don't want to open the pdf report these are my JVM args
> >
> > -XX:G1MixedGCCountTarget=12 -XX:InitialHeapSize=7516192768
> > -XX:MaxGCPauseMillis=200 -XX:MaxHeapSize=7516192768 -
> > XX:MetaspaceSize=268435456 -XX:+PrintAdaptiveSizePolicy -XX:+PrintGC
> > -XX:+PrintGCDateStamps -XX:+PrintGCDetails -XX:+PrintGCTimeStamps
> > -XX:+PrintPromotionFailure -XX:+PrintTenuringDistribution
> > -XX:+UseCompressedClassPointers -XX:+UseCompressedOops -XX:+UseG1GC
> > -XX:- UseNUMA
> >
> >
> > Regards
> >
> >
> > _______________________________________________
> > hotspot-gc-use mailing list
> > hotspot-gc-use at openjdk.java.net
> > http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
> >
> _______________________________________________
> hotspot-gc-use mailing list
> hotspot-gc-use at openjdk.java.net
> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20180913/3e1f7092/attachment.html>

From stefan.johansson at oracle.com  Fri Sep 14 08:55:41 2018
From: stefan.johansson at oracle.com (Stefan Johansson)
Date: Fri, 14 Sep 2018 10:55:41 +0200
Subject: G1GC fine tuning under heavy load
In-Reply-To: <CABtOXeaGCrLLKhE0uo-OK+CsTwx_dpwJmObm6d=e+dN3i2fsNQ@mail.gmail.com>
References: <CABtOXeYnbMYFtgsRm=4-5fUNzjx4wLXY5t3XeXnahVngKD3TQg@mail.gmail.com>
 <c0f3a5af-189b-056a-a771-6adf8bcaa660@oracle.com>
 <CABtOXeaGCrLLKhE0uo-OK+CsTwx_dpwJmObm6d=e+dN3i2fsNQ@mail.gmail.com>
Message-ID: <de294792-3290-cccc-0e71-e3c7c7a8fc9c@oracle.com>

Thanks for the log,

After looking through the log I think the suggestion to increase 
InitiatingHeapOccupancyPercent (IHOP) stands. Your old generation is 
growing slowly and after a while it gets above the concurrent cycle 
threshold which is 45% of the heap (~ 3225M for a 7G heap). When the 
concurrent cycle finishes and the Mixed collections try to reclaim old 
generation space only a few megabytes is reclaimed before the ergonomics 
decide to not do more Mixed collections. The reason for this is that 
most of the old generation data is still live, roughly 90%. If you want 
the Mixed collection to reclaim some more space you can lower the 
G1HeapWastePercent value, which by default is 5. This will generate a 
few more Mixed collections that can reclaim some additional space, and 
it might be worth it since some much in old is live.

If you raise the IHOP a bit you should be able to avoid the back to back 
concurrent cycles and also be able to reclaim some more space when the 
Mixed collections are actually running.

One thing that you should investigate is why the old generation is 
growing over time, if this is expected you will run into the same 
problem later on when it has grown to the new IHOP value.

One additional point, if you have the possibility I would suggest trying 
to run with a later version of Java, there have been some great 
improvements done to G1 and the feature adaptive IHOP could very well 
help you to avoid doing this kind of tuning.

Hope this helps,
Stefan

On 2018-09-13 23:34, Wajih Ahmed wrote:
> The link to the file is 
> https://drive.google.com/open?id=1Y8ZvxxHy078xPjJFy26nMhxeTaWgIcwn.? And 
> i hope i have the correct file corresponding to the report but even if 
> it is not it will exhibit the same pattern as it could be from one of 
> the servers in the cluster.
> 
> Regards,
> 
> On Thu, Sep 13, 2018 at 8:40 AM Stefan Johansson 
> <stefan.johansson at oracle.com <mailto:stefan.johansson at oracle.com>> wrote:
> 
>     Hi,
> 
>     Could you please provide the GC logs from the run as well, the reports
>     give a good overview but some details from the logs might help us give
>     better advise. It will also help to rule out that there are no Full GCs
>     occurring as you say. Some more comments inline.
> 
>     On 2018-09-12 15:26, Wajih Ahmed wrote:
>      > Hello,
>      >
>      > I have an application running on two nodes in a kubernetes
>     cluster. It
>      > is handling about 70 million requests per day.? I have noticed a
>     gradual
>      > decline in the throughput so much so that in about 7 days the
>     througput
>      > falls about 50%.? Although large percent of this decline is in
>     the first
>      > hour and then a gradual decline.
>      > This graph
>      >
>     <https://drive.google.com/open?id=19pG4j2ezNj-jm69Br7HqGKtR_c7L_-r6> to
>      > shows this pattern.? Some of the decline i can attribute to the
>      > application and use case itself. As database starts growing
>     rapidly the
>      > system come under memory and cpu pressure and the database itself is
>      > also a java application.? So perhaps ignoring the decline of the
>     first
>      > hour is prudent but i am still interested in seeing if i can tune
>     the
>      > jvm of the app so that the throughput is more linear after the
>     first hour.
>      >
>      > I am also providing a gceasy.io <http://gceasy.io>
>     <http://gceasy.io> report
>      > <https://drive.google.com/open?id=1s0akdn6ztj2-oeOHwEjFMqbDRpYOweJJ>
>      > that will
>      > give the required information about GC activity.? You will see i
>     have
>      > done some rudementary tuning already.
>      >
>      > What i am curious about is if the young gen size needs to be
>     reduced by
>      > tunring G1NewSizePercent to reduce the duration of the pauses in
>      > particular the object copy stage.
> 
>     This is a very hard question and answer, a smaller young gen of course
>     mean less regions to collect but since the GCs will occur more
>     frequently, less objects will have time to die, so it might be that a
>     larger young gen is quicker to collect for some applications. And since
>     long pause times doesn't seem to be the biggest problem, I wouldn't
>     start the tuning here.
> 
>      >
>      > Secondly what GCEasy is calling "consecutive full gc" don't
>     appear to be
>      > full GC's.? But it might be CMS (initial-mark) activity which
>     accouts
>      > for most of the GC activity and has some long pause times.? Will
>      > increasing InitiatingHeapOccupancyPercent be recommended to
>     reduce this
>      > activity and give the application more time?
>      >
> 
>     Looking at the report it looks like the old generation grows over time
>     and it might be that a lot of it is live so the concurrent cycles don't
>     free up that much and that you still are above the limit afterwards. If
>     this is the case setting a higher InitiatingHeapOccupancyPercent could
>     help.
> 
>     Would also be helpful to know what version of Java you are running.
> 
>     Cheers,
>     Stefan
> 
>      > Any other advise will be helpful as i start to learn and unfold the
>      > mystries of GC tuning :-)
>      >
>      > Just in case you don't want to open the pdf report these are my
>     JVM args
>      >
>      > -XX:G1MixedGCCountTarget=12 -XX:InitialHeapSize=7516192768
>      > -XX:MaxGCPauseMillis=200 -XX:MaxHeapSize=7516192768 -
>      > XX:MetaspaceSize=268435456 -XX:+PrintAdaptiveSizePolicy -XX:+PrintGC
>      > -XX:+PrintGCDateStamps -XX:+PrintGCDetails -XX:+PrintGCTimeStamps
>      > -XX:+PrintPromotionFailure -XX:+PrintTenuringDistribution
>      > -XX:+UseCompressedClassPointers -XX:+UseCompressedOops -XX:+UseG1GC
>      > -XX:- UseNUMA
>      >
>      >
>      > Regards
>      >
>      >
>      > _______________________________________________
>      > hotspot-gc-use mailing list
>      > hotspot-gc-use at openjdk.java.net
>     <mailto:hotspot-gc-use at openjdk.java.net>
>      > http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>      >
>     _______________________________________________
>     hotspot-gc-use mailing list
>     hotspot-gc-use at openjdk.java.net <mailto:hotspot-gc-use at openjdk.java.net>
>     http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>