Long safepoint pause directly after GC log file rotation in 1.7.0_80

Tue Nov 10 15:21:19 UTC 2015

Am 10.11.2015 um 15:53 schrieb Rainer Jung:
> Am 10.11.2015 um 14:31 schrieb Alex Bagehot:
>> http://bugs.java.com/bugdatabase/view_bug.do?bug_id=6478739 relates (&
>> states it is only for solaris)
>
> Thanks for the explanation. I now found, why the problem happens on 80
> but not on 76.
>
> Both versions contain the same very inefficient code. But in 76 there
> was a bug that closed the map file after reading the first entry, becuse
> the close() was done inside the read loop instead of after the read
> loop. That bug was fixed in 80, so that now actually the whole file is
> read and not only the first item. Only then does the performance bug
> exhibit itself.
>
> The responsible commit  is
>
> http://hg.openjdk.java.net/jdk7u/jdk7u/hotspot/diff/e50eb3195734/src/os/solaris/vm/os_solaris.cpp
>
>
> But the real bug is reading the map file item by item instead of using
> pread() to read it all at one like pmap does it.
>
> I expect the problem to be also in 1.8.0 and 1.9.0 (to be checked).

Yes, the same problem is in current 1.8.0 and 1.9.0. For 1.8.0 it was 
introduced in 1.8.0_20.

> What's the right list to tell about this hotspot issue?
>
> Regards,
>
> Rainer
>
>> On Tue, Nov 10, 2015 at 12:47 PM, Rainer Jung
>> <rainer.jung at kippdata.de> wrote:
>>> The pause is due to the call "(void) check_addr0(st)" in
>>> os::print_memory_info().
>>>
>>> The call reads "/proc/self/map". In our case it has for instance 1400
>>> entries, and each read takes about 40 ms.
>>>
>>> The same function check_addr0() is also used in
>>> os::run_periodic_checks().
>>> Not sure why it is also done directly after each GC log rotation.
>>>
>>> Regards,
>>>
>>> Rainer
>>>
>>>
>>> Am 10.11.2015 um 10:13 schrieb Rainer Jung:
>>>>
>>>> Addition: the longest pause that was experienced was more than 2400
>>>> seconds ...
>>>>
>>>> And: platform is Solaris Sparc (T4). But we don't know whether it is
>>>> platform dependent.
>>>>
>>>> It also happens on test systems, so I'll write a script that calls
>>>> pstack when detection is detected to find out, what the threads are
>>>> doing or where they are hanging.
>>>>
>>>> Regards,
>>>>
>>>> Rainer
>>>>
>>>> Am 09.11.2015 um 21:23 schrieb Rainer Jung:
>>>>>
>>>>> Hi,
>>>>>
>>>>> after upgrading from 1.7.0_76 to 1.7.0_80 we experience long pauses
>>>>> directly after a GC log rotation.
>>>>>
>>>>> The pause duration varies due to application and load but is in the
>>>>> range of 6 seconds to 60 seconds. There is no GC involved, i.e. no GC
>>>>> output is written related to these pauses.
>>>>>
>>>>> Example:
>>>>>
>>>>> Previous file ends with:
>>>>>
>>>>> 2015-11-09T01:28:36.832+0100: 38461,486: Application time: 5,2840810
>>>>> seconds
>>>>> {Heap before GC invocations=7366 (full 8):
>>>>> par new generation   total 458752K, used 442678K [0xfffffffe00400000,
>>>>> 0xfffffffe20400000, 0xfffffffe20400000)
>>>>>     eden space 393216K, 100% used [0xfffffffe00400000,
>>>>> 0xfffffffe18400000, 0xfffffffe18400000)
>>>>>     from space 65536K,  75% used [0xfffffffe18400000,
>>>>> 0xfffffffe1b44d998,
>>>>> 0xfffffffe1c400000)
>>>>>     to   space 65536K,   0% used [0xfffffffe1c400000,
>>>>> 0xfffffffe1c400000,
>>>>> 0xfffffffe20400000)
>>>>> concurrent mark-sweep generation total 3670016K, used 1887085K
>>>>> [0xfffffffe20400000, 0xffffffff00400000, 0xffffffff00400000)
>>>>> concurrent-mark-sweep perm gen total 524288K, used 453862K
>>>>> [0xffffffff00400000, 0xffffffff20400000, 0xffffffff20400000)
>>>>> 2015-11-09T01:28:36.839+0100: 38461,493:
>>>>> [GC2015-11-09T01:28:36.840+0100: 38461,493: [ParNew
>>>>> Desired survivor size 33554432 bytes, new threshold 16 (max 31)
>>>>> - age   1:    2964800 bytes,    2964800 total
>>>>> - age   2:    2628048 bytes,    5592848 total
>>>>> - age   3:    1415792 bytes,    7008640 total
>>>>> - age   4:    1354008 bytes,    8362648 total
>>>>> - age   5:    1132056 bytes,    9494704 total
>>>>> - age   6:    1334072 bytes,   10828776 total
>>>>> - age   7:    1407336 bytes,   12236112 total
>>>>> - age   8:    3321304 bytes,   15557416 total
>>>>> - age   9:    1531064 bytes,   17088480 total
>>>>> - age  10:    2453024 bytes,   19541504 total
>>>>> - age  11:    2797616 bytes,   22339120 total
>>>>> - age  12:    1698584 bytes,   24037704 total
>>>>> - age  13:    1870064 bytes,   25907768 total
>>>>> - age  14:    2211528 bytes,   28119296 total
>>>>> - age  15:    3626888 bytes,   31746184 total
>>>>> : 442678K->37742K(458752K), 0,0802687 secs]
>>>>> 2329763K->1924827K(4128768K), 0,0812120 secs] [Times: user=0,90
>>>>> sys=0,03, real=0,08 secs]
>>>>> Heap after GC invocations=7367 (full 8):
>>>>> par new generation   total 458752K, used 37742K [0xfffffffe00400000,
>>>>> 0xfffffffe20400000, 0xfffffffe20400000)
>>>>>     eden space 393216K,   0% used [0xfffffffe00400000,
>>>>> 0xfffffffe00400000, 0xfffffffe18400000)
>>>>>     from space 65536K,  57% used [0xfffffffe1c400000,
>>>>> 0xfffffffe1e8db9a0,
>>>>> 0xfffffffe20400000)
>>>>>     to   space 65536K,   0% used [0xfffffffe18400000,
>>>>> 0xfffffffe18400000,
>>>>> 0xfffffffe1c400000)
>>>>> concurrent mark-sweep generation total 3670016K, used 1887085K
>>>>> [0xfffffffe20400000, 0xffffffff00400000, 0xffffffff00400000)
>>>>> concurrent-mark-sweep perm gen total 524288K, used 453862K
>>>>> [0xffffffff00400000, 0xffffffff20400000, 0xffffffff20400000)
>>>>> }
>>>>> ....
>>>>> 2015-11-09T01:28:36.921+0100: 38461,575: Total time for which
>>>>> application threads were stopped: 0,0888232 seconds, Stopping threads
>>>>> took: 0,0005420 seconds
>>>>> 2015-11-09T01:28:59.821+0100: 38484,474: Application time: 0,0002954
>>>>> seconds
>>>>> 2015-11-09T01:28:59.823+0100: 38484,477: Total time for which
>>>>> application threads were stopped: 0,0026081 seconds, Stopping threads
>>>>> took: 0,0004146 seconds
>>>>> 2015-11-09T01:28:59.824+0100: 38484,477: Application time: 0,0003073
>>>>> seconds
>>>>> 2015-11-09T01:28:59.826+0100: 38484,480: Total time for which
>>>>> application threads were stopped: 0,0025411 seconds, Stopping threads
>>>>> took: 0,0004064 seconds
>>>>> 2015-11-09T01:28:59.827+0100: 38484,480: Application time: 0,0002885
>>>>> seconds
>>>>> 2015-11-09 01:28:59 GC log file has reached the maximum size. Saved as
>>>>> ./application/logs-a/mkb_gc.log.2
>>>>>
>>>>> This output looks normal. Last timestamp is 2015-11-09T01:28:59.827
>>>>>
>>>>> Now the next file begins:
>>>>>
>>>>> 2015-11-09 01:28:59 GC log file created
>>>>> ./application/logs-a/mkb_gc.log.3
>>>>> Java HotSpot(TM) 64-Bit Server VM (24.80-b11) for solaris-sparc JRE
>>>>> (1.7.0_80-b15), built on Apr 10 2015 18:47:18 by "" with Sun Studio
>>>>> 12u1
>>>>> Memory: 8k page, physical 133693440k(14956840k free)
>>>>> CommandLine flags: -XX:AllocateInstancePrefetchLines=2
>>>>> -XX:AllocatePrefetchInstr=1 -XX:AllocatePrefetchLines=6
>>>>> -XX:AllocatePrefetchStyle=3 -XX:+CMSClassUnloadingEnabled
>>>>> -XX:+ExplicitGCInvokesConcurrentAndUnloadsClasses
>>>>> -XX:GCLogFileSize=10485760 -XX:InitialHeapSize=4294967296
>>>>> -XX:MaxHeapSize=4294967296 -XX:MaxNewSize=536870912
>>>>> -XX:MaxPermSize=536870912 -XX:MaxTenuringThreshold=31
>>>>> -XX:NewSize=536870912 -XX:NumberOfGCLogFiles=10 -XX:OldPLABSize=16
>>>>> -XX:ParallelGCThreads=16 -XX:PermSize=536870912 -XX:+PrintGC
>>>>> -XX:+PrintGCApplicationConcurrentTime
>>>>> -XX:+PrintGCApplicationStoppedTime
>>>>> -XX:+PrintGCDateStamps -XX:+PrintGCDetails -XX:+PrintGCTimeStamps
>>>>> -XX:+PrintHeapAtGC -XX:+PrintTenuringDistribution -XX:SurvivorRatio=6
>>>>> -XX:-UseAdaptiveSizePolicy -XX:+UseCompressedOops
>>>>> -XX:+UseConcMarkSweepGC -XX:+UseGCLogFileRotation -XX:+UseParNewGC
>>>>> 2015-11-09T01:29:55.640+0100: 38540,292: Total time for which
>>>>> application threads were stopped: 55,8119519 seconds, Stopping threads
>>>>> took: 0,0003857 seconds
>>>>> 2015-11-09T01:29:55.648+0100: 38540,299: Application time: 0,0076173
>>>>> seconds
>>>>>
>>>>> Note the 55.8 seconds pause directly after printing the flags and the
>>>>> consistent timestamp jump from 01:28:59 to 01:29:55. There's no GC
>>>>> output, although verbose GC is active and works. For some other reason
>>>>> there is a very long safepoint. Note also, that the time is not due to
>>>>> waiting until the safepoint is reached. At least the log claims that
>>>>> reaching the safepoint only took 0.008 seconds. Also at that timeof
>>>>> day
>>>>> the servers are not very busy.
>>>>>
>>>>> Is there any idea, what happens here? Anything that rings a bell
>>>>> between
>>>>> 1.7.0_76 and 1.7.0_80? Why should there be a long safepoint directly
>>>>> after GC rotation opened a new file?
>>>>>
>>>>> I searched the bug parade, but didn't find a good hit. There was also
>>>>> nothing in the change for JDK-7164841 that seemed immediately
>>>>> responsible for a long pause.
>>>>>
>>>>> Unfortunately this happens on a production system and the first thing
>>>>> was to roll back to the old Java version.Not sure, how good this
>>>>> will be
>>>>> reproducible on a test system (will check tomorrow).
>>>>>
>>>>> Thanks for any hint,
>>>>>
>>>>> Rainer