High sys time in G1 GC logs and questions about big heap tuning

Fri Oct 20 15:43:34 UTC 2017

Hi,

We've been troubleshooting a big traffic application for some time now 
and have discovered GC times are impacting performance. Next paragraph 
just to give some context:

We are using G1 GC with JDK 8u131 on a RHEL 6 server with latest patches 
and these JVM parameters:

         javaOptions = -Xms128g
         javaOptions = -Xmx128g
         javaOptions = -XX:MetaspaceSize=128m
         javaOptions = -XX:+UseG1GC
         javaOptions = -XX:InitiatingHeapOccupancyPercent=15
         javaOptions = -XX:MaxGCPauseMillis=200
         javaOptions = -XX:+ParallelRefProcEnabled
         javaOptions = -XX:SoftRefLRUPolicyMSPerMB=0
         javaOptions = -XX:ParallelGCThreads=24
         javaOptions = -XX:+PrintGCDetails
         javaOptions = -XX:+PrintGCDateStamps
         javaOptions = -XX:+PrintTenuringDistribution
         javaOptions = -XX:+PrintAdaptiveSizePolicy
         javaOptions = -XX:+PrintReferenceGC
         javaOptions = -XX:+UnlockDiagnosticVMOptions
         javaOptions = -XX:+G1SummarizeRSetStats
         javaOptions = 
-Xloggc:/opt/gerrit/review_site/logs/jvm/jvm_gc-%t.log
         javaOptions = -XX:+UseGCLogFileRotation
         javaOptions = -XX:GCLogFileSize=40M
         javaOptions = -XX:NumberOfGCLogFiles=20
         javaOptions = -Djava.util.Arrays.useLegacyMergeSort=true

and the system was stable. Some precision: The IHOP value was reduced as 
a way to avoid having full GCs that were killing us and it worked well 
for this purpose. The only problem we saw with these parameters was a 
bit of impact on throughput but, as said, the system was still very usable.

After upgrading the application to latest version, the memory pattern 
changed dramatically and full GCs went back making the application 
unusable. The main change we saw was a very fast increase in old 
generation (up to 115GB) while young generation kept low. As an urgent 
measure, the system memory was upgraded and the heap size put to 256GB 
with which there was a dramatic improvement in application performance 
but we're still seeing a big regression in some operations that are 
taking 3-4 times more. We understand we need to look at the code itself 
(this application is an open source one) to see how object creation has 
changed, but we have a lot of pressure of users and management to solve 
this right now so changes in code need to wait a bit. Our goal right now 
is to make the application as usable as possible tuning memory usage to 
give us a bit of room to start looking at the code.

The increase in heap size led to very low throughput, around 70%. In 
order to try to improve it, we decided to relax pause goal to 300ms. 
Given doing this didn't affect much the application (at least it was not 
worse) and things improved a bit but not enough. Next we decided to try 
to start collecting a bit later to reduce the number of mixed 
collections so we increase IHOP to 18% (very modest modification). After 
that, looking at the GC logs we saw that, without having full GCs, there 
are young collections taking long time (up to a minute in a case). In 
some of those cases, the sys time was a lot higher than the user time, 
which is kind of strange and we haven't found much information about the 
causes of this but we see is affecting us a lot:

Timestamp    User Time (secs)    Sys Time (secs)    Real Time (secs)
2017-10-20T14:19:41    1.68    9.5    0.48
2017-10-20T14:19:59    2.26    9.56    0.56
2017-10-20T14:34:45    12.26    14.01    1.23
2017-10-20T15:26:33    9.82    31.73    2.39
2017-10-20T15:26:37    15.68    46.47    3.19
2017-10-20T15:26:41    10.11    39.64    2.44
2017-10-20T15:28:18    15.53    45.14    3.08
2017-10-20T15:29:01    5.4    41.63    2.43
2017-10-20T15:29:05    74.69    93.91    11.51
2017-10-20T15:29:18    80.67    137.35    18.38
2017-10-20T15:30:55    19.65    23.99    2.16
2017-10-20T15:31:03    11.03    29.28    1.82
2017-10-20T15:31:28    4.88    8.44    0.77
2017-10-20T15:31:30    1.58    59.97    3.21
2017-10-20T15:31:35    57.13    135.82    14.96
2017-10-20T15:32:23    25.46    150.54    14.79
2017-10-20T15:33:11    87.92    112.98    16.46
2017-10-20T15:35:13    17.99    32.15    2.33
2017-10-20T15:35:54    154.7    191.82    29.67

Any hints here?

Second question is about general tuning for this big heap as seems we're 
not going well so far. We've been reading a lot of documentation about 
the tuning of this algorithm and thought we had a fairly acceptable 
understanding of the process but this has proven wrong as some of our 
tunings (as increasing the IHOP to bigger values and increasing the 
number of parallel threads) haven't show improvement and rather 
regressions in some cases, as causing longer young collections or 
dropping in throughput. I'm joining a chunk of the log so you can have a 
quick look if possible and probably point to some ideas.

Thanks a lot,

Hector Caballero

-------------- next part --------------
A non-text attachment was scrubbed...
Name: jvm_gc-2017-10-20_14-19-14.log.0.current.gz
Type: application/gzip
Size: 1159006 bytes
Desc: not available
URL: <https://mail.openjdk.org/pipermail/hotspot-gc-dev/attachments/20171020/350a87dd/jvm_gc-2017-10-20_14-19-14.log.0.current.gz>