occasional ParNew times of 15+ seconds

David Ely david.ely at unboundid.com
Mon Oct 31 00:09:38 UTC 2016


Thanks again Vitaly. Responses inline.

On Sun, Oct 30, 2016 at 1:56 PM, Vitaly Davidovich <vitalyd at gmail.com>
wrote:

>
>
> On Sunday, October 30, 2016, David Ely <david.ely at unboundid.com> wrote:
>
>> Thank you Vitaly and Charlie. We will have them disable THP, move to a
>> later version of the JVM, and add in some additional GC logging JVM options.
>>
>> Looking more at the GC log, it appears that the long ParNew pauses only
>> occur when the old generation usage is at least half of the distance
>> between the live size and when CMS is triggered via
>> CMSInitiatingOccupancyFraction. After a CMS collection, the long pauses
>> stop. However, there are plenty of CMS cycles where we don't see any long
>> pauses, and there are plenty of places where we promote the same amount of
>> data associated with a long pause but don't experience a long pause.
>>
>> Is this behavior consistent with the THP diagnosis?
>>
> The very high sys time is unusual for a parnew collection.  THP defrag is
> one possible known cause of that.  It's certainly possible there's
> something else going on, but turning off THP is a good start in
> troubleshooting; even if it's not the cause here, it may bite your customer
> later.
>

The sys times are high, but they are not especially high relative to the
user times. The ratio across all of the ParNew collections is about the
same.


>
> A few more questions in the meantime:
>
> 1) are these parnew tails reproducible?
>

I believe so. They are seeing it on multiple systems. It seems to have
gotten worse on the newer systems, which have 256GB of RAM compared to 96GB.


> 2) is this running on bare metal or VM?
>

Bare metal.


> 3) what's the hardware spec?
>

These specific pauses on hardware they acquired recently. Java sees 48
CPUs, and it has 256GB of RAM.


>
> If you can have the customer disable THP without bumping the JVM version,
> it would help pinpoint the issue.  But, I understand if you just want to
> fix the issue asap.
>

Since they are seeing this on multiple systems, they should be able to have
at least one where they only disable THP.

They'll have to put these changes through their testing environment, so it
might be a little while before I'll have an update.


>
>
>>
>> On Sat, Oct 29, 2016 at 6:15 PM, charlie hunt <charlie.hunt at oracle.com>
>> wrote:
>>
>>> +1 on disabling THP
>>>
>>> Charlie
>>>
>>> On Oct 29, 2016, at 10:07 AM, Vitaly Davidovich <vitalyd at gmail.com>
>>> wrote:
>>>
>>> David,
>>>
>>> Ask them to turn off THP - it's a known source of large latency due to
>>> the kernel doing page defragmentation; your app takes a page fault, and
>>> boom - the kernel may start doing defragmentation to make a huge page
>>> available.  You can search online for THP issues.  The symptoms are similar
>>> to yours - very high sys time.
>>>
>>> If they turn it off and still get same lengthy parnew pauses, then it's
>>> clearly something else but at least we'll eliminate THP as the culprit.
>>>
>>> On Saturday, October 29, 2016, David Ely <david.ely at unboundid.com>
>>> wrote:
>>>
>>>> Thank you for the response. Yes. meminfo (see full output below) shows
>>>> ~80GB of AnonHugePages, which is pretty close to the size of the JVM (full
>>>> output below). Looking back through previous information that we have from
>>>> this customer, transparent huge pages have been turned on for years.
>>>> We've asked them for anything else that might have changed in this
>>>> environment.
>>>>
>>>> Are there any other JVM options that we could enable that would shed
>>>> light on what's going on within the ParNew? Would -XX:+PrintTLAB
>>>> -XX:+PrintPLAB -XX:PrintFLSStatistics=1 show anything useful?
>>>>
>>>> David
>>>>
>>>>
>>>> MemTotal:       264396572 kB
>>>> MemFree:         2401576 kB
>>>> Buffers:          381564 kB
>>>> Cached:         172673120 kB
>>>> SwapCached:            0 kB
>>>> Active:         163439836 kB
>>>> Inactive:       90737452 kB
>>>> Active(anon):   76910848 kB
>>>> Inactive(anon):  4212580 kB
>>>> Active(file):   86528988 kB
>>>> Inactive(file): 86524872 kB
>>>> Unevictable:           0 kB
>>>> Mlocked:               0 kB
>>>> SwapTotal:      16236540 kB
>>>> SwapFree:       16236540 kB
>>>> Dirty:             14552 kB
>>>> Writeback:             0 kB
>>>> AnonPages:      81111768 kB
>>>> Mapped:            31312 kB
>>>> Shmem:               212 kB
>>>> Slab:            6078732 kB
>>>> SReclaimable:    5956052 kB
>>>> SUnreclaim:       122680 kB
>>>> KernelStack:       41296 kB
>>>> PageTables:       171324 kB
>>>> NFS_Unstable:          0 kB
>>>> Bounce:                0 kB
>>>> WritebackTmp:          0 kB
>>>> CommitLimit:    148434824 kB
>>>> Committed_AS:   93124984 kB
>>>> VmallocTotal:   34359738367 kB
>>>> VmallocUsed:      686780 kB
>>>> VmallocChunk:   34225639420 kB
>>>> HardwareCorrupted:     0 kB
>>>> *AnonHugePages:  80519168 kB*
>>>> HugePages_Total:       0
>>>> HugePages_Free:        0
>>>> HugePages_Rsvd:        0
>>>> HugePages_Surp:        0
>>>> Hugepagesize:       2048 kB
>>>> DirectMap4k:        5132 kB
>>>> DirectMap2M:     1957888 kB
>>>> DirectMap1G:    266338304 kB
>>>>
>>>>
>>>> On Fri, Oct 28, 2016 at 8:04 PM, Vitaly Davidovich <vitalyd at gmail.com>
>>>> wrote:
>>>>
>>>>>
>>>>>
>>>>> On Friday, October 28, 2016, David Ely <david.ely at unboundid.com>
>>>>> wrote:
>>>>>
>>>>>> While typical ParNew GC times are 50ms, our application is
>>>>>> occasionally hitting ParNew times that are over 15 seconds for one of our
>>>>>> customers, and we have no idea why. Looking at the full GC log file:
>>>>>>
>>>>>> 382250 ParNew GCs are < 1 second
>>>>>> 9303 are 100ms to 1 second
>>>>>> 1267 are 1 second to 2 seconds
>>>>>> 99 are 2 seconds to 10 seconds
>>>>>> 24 are > 10 seconds, 48 seconds being the max
>>>>>>
>>>>>> The long ones are somewhat bursty as you can see from looking at the
>>>>>> line numbers in the GC log:
>>>>>>
>>>>>> $ egrep -n  '(ParNew.*real=[1-9][0-9]\)' gc.log.0
>>>>>>
>>>>>> 12300:2016-10-21T01:03:20.380+0000: 20278.069:
>>>>>> [GC2016-10-21T01:03:20.380+0000: 20278.070: [ParNew:
>>>>>> 1697741K->10024K(1887488K), 16.9913450 secs] 33979542K->32817239K(84724992K),
>>>>>> 16.9921050 secs] [Times: user=541.32 sys=14.37, real=16.99 secs]
>>>>>> 43730:2016-10-21T14:12:25.050+0000: 67622.740:
>>>>>> [GC2016-10-21T14:12:25.051+0000: 67622.740: [ParNew:
>>>>>> 1728194K->33817K(1887488K), 12.7508470 secs] 49737924K->48320707K(84724992K),
>>>>>> 12.7517840 secs] [Times: user=405.89 sys=11.05, real=12.75 secs]
>>>>>> 44079:2016-10-21T14:18:55.172+0000: 68012.862:
>>>>>> [GC2016-10-21T14:18:55.173+0000: 68012.862: [ParNew:
>>>>>> 1698371K->26958K(1887488K), 12.7384460 secs] 50339815K->48930730K(84724992K),
>>>>>> 12.7392360 secs] [Times: user=406.58 sys=11.29, real=12.73 secs]
>>>>>> 50151:2016-10-21T17:10:14.471+0000: 78292.160:
>>>>>> [GC2016-10-21T17:10:14.471+0000: 78292.161: [ParNew:
>>>>>> 1713813K->40968K(1887488K), 18.6593320 secs] 49366906K->47959129K(84724992K),
>>>>>> 18.6602550 secs] [Times: user=590.03 sys=17.45, real=18.66 secs]
>>>>>> 56073:2016-10-21T19:59:36.847+0000: 88454.536:
>>>>>> [GC2016-10-21T19:59:36.847+0000: 88454.537: [ParNew:
>>>>>> 1685720K->20763K(1887488K), 16.0840200 secs] 50704025K->49302131K(84724992K),
>>>>>> 16.0848810 secs] [Times: user=487.00 sys=16.84, real=16.09 secs]
>>>>>> 78987:2016-10-22T05:49:25.623+0000: 123843.312:
>>>>>> [GC2016-10-22T05:49:25.623+0000: 123843.313: [ParNew:
>>>>>> 1709771K->22678K(1887488K), 10.9933380 secs] 43323834K->41914203K(84724992K),
>>>>>> 10.9943060 secs] [Times: user=349.67 sys=9.84, real=10.99 secs]
>>>>>> 79104:2016-10-22T05:59:26.382+0000: 124444.071:
>>>>>> [GC2016-10-22T05:59:26.382+0000: 124444.072: [ParNew:
>>>>>> 1697024K->22260K(1887488K), 11.5490390 secs] 44558499K->43145880K(84724992K),
>>>>>> 11.5499650 secs] [Times: user=367.73 sys=10.01, real=11.55 secs]
>>>>>> 79504:2016-10-22T06:09:36.983+0000: 125054.672:
>>>>>> [GC2016-10-22T06:09:36.984+0000: 125054.673: [ParNew:
>>>>>> 1688112K->4769K(1887488K), 14.1528810 secs] 46684947K->45263748K(84724992K),
>>>>>> 14.1539860 secs] [Times: user=452.28 sys=12.71, real=14.15 secs]
>>>>>> 79772:2016-10-22T06:30:36.130+0000: 126313.819:
>>>>>> [GC2016-10-22T06:30:36.130+0000: 126313.820: [ParNew:
>>>>>> 1725520K->35893K(1887488K), 14.4479670 secs] 48989739K->47563879K(84724992K),
>>>>>> 14.4488810 secs] [Times: user=461.60 sys=13.04, real=14.45 secs]
>>>>>> 80087:2016-10-22T06:37:07.202+0000: 126704.891:
>>>>>> [GC2016-10-22T06:37:07.202+0000: 126704.892: [ParNew:
>>>>>> 1698021K->23440K(1887488K), 15.7039920 secs] 50517163K->49105987K(84724992K),
>>>>>> 15.7050040 secs] [Times: user=497.65 sys=14.75, real=15.70 secs]
>>>>>> 89969:2016-10-22T13:54:27.978+0000: 152945.667:
>>>>>> [GC2016-10-22T13:54:27.978+0000: 152945.668: [ParNew:
>>>>>> 1834914K->15978K(1887488K), 11.5637150 secs] 48716340K->47307673K(84724992K),
>>>>>> 11.5645440 secs] [Times: user=367.77 sys=10.01, real=11.57 secs]
>>>>>> 90200:2016-10-22T14:05:02.717+0000: 153580.407:
>>>>>> [GC2016-10-22T14:05:02.718+0000: 153580.407: [ParNew:
>>>>>> 1684626K->7078K(1887488K), 17.3424650 secs] 50361539K->48947648K(84724992K),
>>>>>> 17.3433490 secs] [Times: user=554.39 sys=15.81, real=17.34 secs]
>>>>>> 90299:2016-10-22T14:14:30.521+0000: 154148.210:
>>>>>> [GC2016-10-22T14:14:30.521+0000: 154148.211: [ParNew:
>>>>>> 1690850K->6078K(1887488K), 13.1699350 secs] 51455784K->50033156K(84724992K),
>>>>>> 13.1708900 secs] [Times: user=419.55 sys=11.54, real=13.17 secs]
>>>>>> 261329:2016-10-26T00:06:44.499+0000: 448882.189:
>>>>>> [GC2016-10-26T00:06:44.500+0000: 448882.189: [ParNew:
>>>>>> 1705614K->22224K(1887488K), 17.5831730 secs] 40683698K->39525817K(84724992K),
>>>>>> 17.5843270 secs] [Times: user=561.85 sys=14.79, real=17.58 secs]
>>>>>> 261935:2016-10-26T00:13:34.277+0000: 449291.967:
>>>>>> [GC2016-10-26T00:13:34.278+0000: 449291.967: [ParNew:
>>>>>> 1690085K->26707K(1887488K), 13.9331790 secs] 43792178K->42655000K(84724992K),
>>>>>> 13.9340780 secs] [Times: user=446.36 sys=11.45, real=13.93 secs]
>>>>>> 262143:2016-10-26T00:20:09.397+0000: 449687.087:
>>>>>> [GC2016-10-26T00:20:09.398+0000: 449687.087: [ParNew:
>>>>>> 1696593K->27078K(1887488K), 40.3344500 secs] 45588644K->44444949K(84724992K),
>>>>>> 40.3355430 secs] [Times: user=1248.15 sys=43.07, real=40.33 secs]
>>>>>> 262275:2016-10-26T00:27:02.196+0000: 450099.886:
>>>>>> [GC2016-10-26T00:27:02.197+0000: 450099.886: [ParNew:
>>>>>> 1683406K->17853K(1887488K), 17.7472360 secs] 46908499K->45506131K(84724992K),
>>>>>> 17.7482260 secs] [Times: user=567.03 sys=16.10, real=17.75 secs]
>>>>>> 262282:2016-10-26T00:27:29.448+0000: 450127.138:
>>>>>> [GC2016-10-26T00:27:29.449+0000: 450127.138: [ParNew:
>>>>>> 1687737K->10499K(1887488K), 35.4934000 secs] 47195678K->46044477K(84724992K),
>>>>>> 35.4943230 secs] [Times: user=1131.34 sys=31.87, real=35.49 secs]
>>>>>> 262631:2016-10-26T00:34:17.632+0000: 450535.321:
>>>>>> [GC2016-10-26T00:34:17.632+0000: 450535.321: [ParNew:
>>>>>> 1687590K->10226K(1887488K), 21.4043600 secs] 49431427K->48018504K(84724992K),
>>>>>> 21.4052230 secs] [Times: user=682.50 sys=19.46, real=21.41 secs]
>>>>>> 262844:2016-10-26T00:41:08.118+0000: 450945.808:
>>>>>> [GC2016-10-26T00:41:08.119+0000: 450945.808: [ParNew:
>>>>>> 1692928K->11302K(1887488K), 48.2899260 secs] 51073216K->49915878K(84724992K),
>>>>>> 48.2909550 secs] [Times: user=1493.17 sys=53.55, real=48.28 secs]
>>>>>> 345421:2016-10-27T04:17:59.617+0000: 550357.306:
>>>>>> [GC2016-10-27T04:17:59.618+0000: 550357.307: [ParNew:
>>>>>> 1695052K->22991K(1887488K), 33.8707510 secs] 46334738K->45187822K(84724992K),
>>>>>> 33.8718980 secs] [Times: user=1081.31 sys=30.59, real=33.86 secs]
>>>>>> 345510:2016-10-27T04:24:11.721+0000: 550729.411:
>>>>>> [GC2016-10-27T04:24:11.722+0000: 550729.411: [ParNew:
>>>>>> 1705080K->20401K(1887488K), 18.9795540 secs] 47388073K->45965537K(84724992K),
>>>>>> 18.9805410 secs] [Times: user=606.94 sys=17.25, real=18.98 secs]
>>>>>> 345514:2016-10-27T04:24:36.695+0000: 550754.385:
>>>>>> [GC2016-10-27T04:24:36.696+0000: 550754.385: [ParNew:
>>>>>> 1707810K->32640K(1887488K), 30.9728200 secs] 47656489K->46506725K(84724992K),
>>>>>> 30.9737300 secs] [Times: user=917.67 sys=33.07, real=30.97 secs]
>>>>>> 345777:2016-10-27T04:31:30.102+0000: 551167.791:
>>>>>> [GC2016-10-27T04:31:30.102+0000: 551167.791: [ParNew:
>>>>>> 1704518K->30860K(1887488K), 38.0976720 secs] 49570144K->48422333K(84724992K),
>>>>>> 38.0984950 secs] [Times: user=1215.89 sys=34.79, real=38.09 secs]
>>>>>>
>>>>>> Context around a single instance is fairly normal:
>>>>>>
>>>>>> 345773-2016-10-27T04:31:28.032+0000: 551165.721:
>>>>>> [GC2016-10-27T04:31:28.033+0000: 551165.722: [ParNew:
>>>>>> 1685858K->8851K(1887488K), 0.0480250 secs] 49545909K->47870050K(84724992K),
>>>>>> 0.0490200 secs] [Times: user=1.47 sys=0.02, real=0.05 secs]
>>>>>> 345774-2016-10-27T04:31:28.635+0000: 551166.324:
>>>>>> [GC2016-10-27T04:31:28.636+0000: 551166.325: [ParNew:
>>>>>> 1686675K->10456K(1887488K), 0.0463570 secs] 49547874K->47872545K(84724992K),
>>>>>> 0.0473410 secs] [Times: user=1.41 sys=0.04, real=0.05 secs]
>>>>>> 345775-2016-10-27T04:31:29.205+0000: 551166.894:
>>>>>> [GC2016-10-27T04:31:29.205+0000: 551166.894: [ParNew:
>>>>>> 1688280K->12733K(1887488K), 0.0487100 secs] 49550369K->47876404K(84724992K),
>>>>>> 0.0496310 secs] [Times: user=1.47 sys=0.04, real=0.05 secs]
>>>>>> 345776-2016-10-27T04:31:29.798+0000: 551167.487:
>>>>>> [GC2016-10-27T04:31:29.798+0000: 551167.488: [ParNew:
>>>>>> 1690557K->26694K(1887488K), 0.0471170 secs] 49554228K->47892320K(84724992K),
>>>>>> 0.0481180 secs] [Times: user=1.40 sys=0.02, real=0.05 secs]
>>>>>> 345777:2016-10-27T04:31:30.102+0000: 551167.791:
>>>>>> [GC2016-10-27T04:31:30.102+0000: 551167.791: [ParNew:
>>>>>> 1704518K->30860K(1887488K), 38.0976720 secs] 49570144K->48422333K(84724992K),
>>>>>> 38.0984950 secs] [Times: user=1215.89 sys=34.79, real=38.09 secs]
>>>>>> 345778-2016-10-27T04:32:08.449+0000: 551206.139:
>>>>>> [GC2016-10-27T04:32:08.450+0000: 551206.139: [ParNew:
>>>>>> 1708684K->122033K(1887488K), 0.0664280 secs] 50100157K->48528020K(84724992K),
>>>>>> 0.0672860 secs] [Times: user=1.60 sys=0.05, real=0.07 secs]
>>>>>> 345779-2016-10-27T04:32:09.090+0000: 551206.779:
>>>>>> [GC2016-10-27T04:32:09.091+0000: 551206.780: [ParNew:
>>>>>> 1799857K->42169K(1887488K), 0.0688910 secs] 50205844K->48541030K(84724992K),
>>>>>> 0.0696110 secs] [Times: user=1.70 sys=0.03, real=0.07 secs]
>>>>>> 345780-2016-10-27T04:32:09.802+0000: 551207.491:
>>>>>> [GC2016-10-27T04:32:09.802+0000: 551207.491: [ParNew:
>>>>>> 1719993K->43790K(1887488K), 0.0508540 secs] 50218854K->48542651K(84724992K),
>>>>>> 0.0516000 secs] [Times: user=1.54 sys=0.03, real=0.05 secs]
>>>>>> 345781-2016-10-27T04:32:10.536+0000: 551208.226:
>>>>>> [GC2016-10-27T04:32:10.537+0000: 551208.226: [ParNew:
>>>>>> 1721614K->30389K(1887488K), 0.0668100 secs] 50220475K->48545932K(84724992K),
>>>>>> 0.0675470 secs] [Times: user=1.81 sys=0.03, real=0.06 secs]
>>>>>> 345782-2016-10-27T04:32:11.137+0000: 551208.826:
>>>>>> [GC2016-10-27T04:32:11.137+0000: 551208.826: [ParNew:
>>>>>> 1708213K->18631K(1887488K), 0.0632570 secs] 50223756K->48540797K(84724992K),
>>>>>> 0.0639650 secs] [Times: user=1.95 sys=0.01, real=0.06 secs]
>>>>>> 345783-2016-10-27T04:32:11.642+0000: 551209.332:
>>>>>> [GC2016-10-27T04:32:11.643+0000: 551209.332: [ParNew:
>>>>>> 1696455K->19415K(1887488K), 0.0509260 secs] 50218621K->48545033K(84724992K),
>>>>>> 0.0516780 secs] [Times: user=1.55 sys=0.03, real=0.05 secs]
>>>>>>
>>>>>> Since the user times are high as well, I don't think this could be
>>>>>> swapping.
>>>>>>
>>>>> Can you ask the customer if they're using transparent hugepages (THP)?
>>>>>
>>>>>>
>>>>>> Here are the hard-earned set of JVM arguments that we're using:
>>>>>>
>>>>>> -d64 -server -Xmx81g -Xms81g -XX:MaxNewSize=2g \
>>>>>>   -XX:NewSize=2g -XX:+UseConcMarkSweepGC -XX:+CMSConcurrentMTEnabled \
>>>>>>   -XX:+CMSParallelRemarkEnabled -XX:+CMSParallelSurvivorRemarkEnabled
>>>>>> \
>>>>>>   -XX:+CMSScavengeBeforeRemark -XX:RefDiscoveryPolicy=1 \
>>>>>>   -XX:ParallelCMSThreads=12 -XX:CMSMaxAbortablePrecleanTime=3600000 \
>>>>>>   -XX:CMSInitiatingOccupancyFraction=60 -XX:+UseParNewGC
>>>>>> -XX:+UseBiasedLocking \
>>>>>>   -XX:MaxTenuringThreshold=2 -XX:+UseCompressedOops -XX:PermSize=256M
>>>>>> \
>>>>>>   -XX:MaxPermSize=256M -XX:+HeapDumpOnOutOfMemoryError \
>>>>>>   -XX:+UseCMSInitiatingOccupancyOnly -XX:+UseMembar
>>>>>> -XX:+UseLargePages \
>>>>>>   -XX:+PrintGCDetails -XX:+PrintGCDateStamps
>>>>>> -XX:+PrintCommandLineFlags \
>>>>>>   -XX:+UseGCLogFileRotation \
>>>>>>   -XX:NumberOfGCLogFiles=3 -XX:GCLogFileSize=100m \
>>>>>>   -Xloggc:${INSTANCE_ROOT}/logs/jvm/gc.log
>>>>>>
>>>>>> This is on Linux with Java 1.7.0_72.
>>>>>>
>>>>>> Does this look familiar to anyone? Alternatively, are there some more
>>>>>> JVM options that we could include to get more information?
>>>>>>
>>>>>> One of the first things that we'll try is to move to a later JVM, but
>>>>>> it will be easier to get the customer to do that if we can point to a
>>>>>> specific issue that has been addressed.
>>>>>>
>>>>>> Thanks for your help.
>>>>>>
>>>>>> David
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Sent from my phone
>>>>>
>>>>
>>>>
>>>
>>> --
>>> Sent from my phone
>>>
>>> _______________________________________________
>>> hotspot-gc-use mailing list
>>> hotspot-gc-use at openjdk.java.net
>>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>>>
>>>
>>
>
> --
> Sent from my phone
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20161030/ddb60367/attachment.html>


More information about the hotspot-gc-use mailing list