High sys time in G1 GC logs and questions about big heap tuning

Fri Oct 20 18:20:08 UTC 2017

On Fri, Oct 20, 2017 at 11:43 AM, Hector Caballero <
hector.caballero at ericsson.com> wrote:

> Hi,
>
> We've been troubleshooting a big traffic application for some time now and
> have discovered GC times are impacting performance. Next paragraph just to
> give some context:
>
> We are using G1 GC with JDK 8u131 on a RHEL 6 server with latest patches
> and these JVM parameters:
>
>         javaOptions = -Xms128g
>         javaOptions = -Xmx128g
>         javaOptions = -XX:MetaspaceSize=128m
>         javaOptions = -XX:+UseG1GC
>         javaOptions = -XX:InitiatingHeapOccupancyPercent=15
>         javaOptions = -XX:MaxGCPauseMillis=200
>         javaOptions = -XX:+ParallelRefProcEnabled
>         javaOptions = -XX:SoftRefLRUPolicyMSPerMB=0
>         javaOptions = -XX:ParallelGCThreads=24
>         javaOptions = -XX:+PrintGCDetails
>         javaOptions = -XX:+PrintGCDateStamps
>         javaOptions = -XX:+PrintTenuringDistribution
>         javaOptions = -XX:+PrintAdaptiveSizePolicy
>         javaOptions = -XX:+PrintReferenceGC
>         javaOptions = -XX:+UnlockDiagnosticVMOptions
>         javaOptions = -XX:+G1SummarizeRSetStats
>         javaOptions = -Xloggc:/opt/gerrit/review_sit
> e/logs/jvm/jvm_gc-%t.log
>         javaOptions = -XX:+UseGCLogFileRotation
>         javaOptions = -XX:GCLogFileSize=40M
>         javaOptions = -XX:NumberOfGCLogFiles=20
>         javaOptions = -Djava.util.Arrays.useLegacyMergeSort=true
>
> and the system was stable. Some precision: The IHOP value was reduced as a
> way to avoid having full GCs that were killing us and it worked well for
> this purpose. The only problem we saw with these parameters was a bit of
> impact on throughput but, as said, the system was still very usable.
>
> After upgrading the application to latest version, the memory pattern
> changed dramatically and full GCs went back making the application
> unusable. The main change we saw was a very fast increase in old generation
> (up to 115GB) while young generation kept low. As an urgent measure, the
> system memory was upgraded and the heap size put to 256GB with which there
> was a dramatic improvement in application performance but we're still
> seeing a big regression in some operations that are taking 3-4 times more.
> We understand we need to look at the code itself (this application is an
> open source one) to see how object creation has changed, but we have a lot
> of pressure of users and management to solve this right now so changes in
> code need to wait a bit. Our goal right now is to make the application as
> usable as possible tuning memory usage to give us a bit of room to start
> looking at the code.
>
> The increase in heap size led to very low throughput, around 70%. In order
> to try to improve it, we decided to relax pause goal to 300ms. Given doing
> this didn't affect much the application (at least it was not worse) and
> things improved a bit but not enough. Next we decided to try to start
> collecting a bit later to reduce the number of mixed collections so we
> increase IHOP to 18% (very modest modification). After that, looking at the
> GC logs we saw that, without having full GCs, there are young collections
> taking long time (up to a minute in a case). In some of those cases, the
> sys time was a lot higher than the user time, which is kind of strange and
> we haven't found much information about the causes of this but we see is
> affecting us a lot:
>
> Timestamp    User Time (secs)    Sys Time (secs)    Real Time (secs)
> 2017-10-20T14:19:41    1.68    9.5    0.48
> 2017-10-20T14:19:59    2.26    9.56    0.56
> 2017-10-20T14:34:45    12.26    14.01    1.23
> 2017-10-20T15:26:33    9.82    31.73    2.39
> 2017-10-20T15:26:37    15.68    46.47    3.19
> 2017-10-20T15:26:41    10.11    39.64    2.44
> 2017-10-20T15:28:18    15.53    45.14    3.08
> 2017-10-20T15:29:01    5.4    41.63    2.43
> 2017-10-20T15:29:05    74.69    93.91    11.51
> 2017-10-20T15:29:18    80.67    137.35    18.38
> 2017-10-20T15:30:55    19.65    23.99    2.16
> 2017-10-20T15:31:03    11.03    29.28    1.82
> 2017-10-20T15:31:28    4.88    8.44    0.77
> 2017-10-20T15:31:30    1.58    59.97    3.21
> 2017-10-20T15:31:35    57.13    135.82    14.96
> 2017-10-20T15:32:23    25.46    150.54    14.79
> 2017-10-20T15:33:11    87.92    112.98    16.46
> 2017-10-20T15:35:13    17.99    32.15    2.33
> 2017-10-20T15:35:54    154.7    191.82    29.67
>
> Any hints here?

Do you have transparent huge pages (THP) enabled? This is a fairly common
cause for outsized sys times.

>

> Second question is about general tuning for this big heap as seems we're
> not going well so far. We've been reading a lot of documentation about the
> tuning of this algorithm and thought we had a fairly acceptable
> understanding of the process but this has proven wrong as some of our
> tunings (as increasing the IHOP to bigger values and increasing the number
> of parallel threads) haven't show improvement and rather regressions in
> some cases, as causing longer young collections or dropping in throughput.
> I'm joining a chunk of the log so you can have a quick look if possible and
> probably point to some ideas.
>
> Thanks a lot,
>
> Hector Caballero
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/hotspot-gc-dev/attachments/20171020/610abfed/attachment.htm>