From rainer.jung at kippdata.de  Mon Nov  9 20:23:25 2015
From: rainer.jung at kippdata.de (Rainer Jung)
Date: Mon, 9 Nov 2015 21:23:25 +0100
Subject: Long safepoint pause directly after GC log file rotation in 1.7.0_80
Message-ID: <564100BD.1070002@kippdata.de>

Hi,

after upgrading from 1.7.0_76 to 1.7.0_80 we experience long pauses 
directly after a GC log rotation.

The pause duration varies due to application and load but is in the 
range of 6 seconds to 60 seconds. There is no GC involved, i.e. no GC 
output is written related to these pauses.

Example:

Previous file ends with:

2015-11-09T01:28:36.832+0100: 38461,486: Application time: 5,2840810 seconds
{Heap before GC invocations=7366 (full 8):
par new generation   total 458752K, used 442678K [0xfffffffe00400000, 
0xfffffffe20400000, 0xfffffffe20400000)
   eden space 393216K, 100% used [0xfffffffe00400000, 
0xfffffffe18400000, 0xfffffffe18400000)
   from space 65536K,  75% used [0xfffffffe18400000, 0xfffffffe1b44d998, 
0xfffffffe1c400000)
   to   space 65536K,   0% used [0xfffffffe1c400000, 0xfffffffe1c400000, 
0xfffffffe20400000)
concurrent mark-sweep generation total 3670016K, used 1887085K 
[0xfffffffe20400000, 0xffffffff00400000, 0xffffffff00400000)
concurrent-mark-sweep perm gen total 524288K, used 453862K 
[0xffffffff00400000, 0xffffffff20400000, 0xffffffff20400000)
2015-11-09T01:28:36.839+0100: 38461,493: 
[GC2015-11-09T01:28:36.840+0100: 38461,493: [ParNew
Desired survivor size 33554432 bytes, new threshold 16 (max 31)
- age   1:    2964800 bytes,    2964800 total
- age   2:    2628048 bytes,    5592848 total
- age   3:    1415792 bytes,    7008640 total
- age   4:    1354008 bytes,    8362648 total
- age   5:    1132056 bytes,    9494704 total
- age   6:    1334072 bytes,   10828776 total
- age   7:    1407336 bytes,   12236112 total
- age   8:    3321304 bytes,   15557416 total
- age   9:    1531064 bytes,   17088480 total
- age  10:    2453024 bytes,   19541504 total
- age  11:    2797616 bytes,   22339120 total
- age  12:    1698584 bytes,   24037704 total
- age  13:    1870064 bytes,   25907768 total
- age  14:    2211528 bytes,   28119296 total
- age  15:    3626888 bytes,   31746184 total
: 442678K->37742K(458752K), 0,0802687 secs] 
2329763K->1924827K(4128768K), 0,0812120 secs] [Times: user=0,90 
sys=0,03, real=0,08 secs]
Heap after GC invocations=7367 (full 8):
par new generation   total 458752K, used 37742K [0xfffffffe00400000, 
0xfffffffe20400000, 0xfffffffe20400000)
   eden space 393216K,   0% used [0xfffffffe00400000, 
0xfffffffe00400000, 0xfffffffe18400000)
   from space 65536K,  57% used [0xfffffffe1c400000, 0xfffffffe1e8db9a0, 
0xfffffffe20400000)
   to   space 65536K,   0% used [0xfffffffe18400000, 0xfffffffe18400000, 
0xfffffffe1c400000)
concurrent mark-sweep generation total 3670016K, used 1887085K 
[0xfffffffe20400000, 0xffffffff00400000, 0xffffffff00400000)
concurrent-mark-sweep perm gen total 524288K, used 453862K 
[0xffffffff00400000, 0xffffffff20400000, 0xffffffff20400000)
}
....
2015-11-09T01:28:36.921+0100: 38461,575: Total time for which 
application threads were stopped: 0,0888232 seconds, Stopping threads 
took: 0,0005420 seconds
2015-11-09T01:28:59.821+0100: 38484,474: Application time: 0,0002954 seconds
2015-11-09T01:28:59.823+0100: 38484,477: Total time for which 
application threads were stopped: 0,0026081 seconds, Stopping threads 
took: 0,0004146 seconds
2015-11-09T01:28:59.824+0100: 38484,477: Application time: 0,0003073 seconds
2015-11-09T01:28:59.826+0100: 38484,480: Total time for which 
application threads were stopped: 0,0025411 seconds, Stopping threads 
took: 0,0004064 seconds
2015-11-09T01:28:59.827+0100: 38484,480: Application time: 0,0002885 seconds
2015-11-09 01:28:59 GC log file has reached the maximum size. Saved as 
./application/logs-a/mkb_gc.log.2

This output looks normal. Last timestamp is 2015-11-09T01:28:59.827

Now the next file begins:

2015-11-09 01:28:59 GC log file created ./application/logs-a/mkb_gc.log.3
Java HotSpot(TM) 64-Bit Server VM (24.80-b11) for solaris-sparc JRE 
(1.7.0_80-b15), built on Apr 10 2015 18:47:18 by "" with Sun Studio 12u1
Memory: 8k page, physical 133693440k(14956840k free)
CommandLine flags: -XX:AllocateInstancePrefetchLines=2 
-XX:AllocatePrefetchInstr=1 -XX:AllocatePrefetchLines=6 
-XX:AllocatePrefetchStyle=3 -XX:+CMSClassUnloadingEnabled 
-XX:+ExplicitGCInvokesConcurrentAndUnloadsClasses 
-XX:GCLogFileSize=10485760 -XX:InitialHeapSize=4294967296 
-XX:MaxHeapSize=4294967296 -XX:MaxNewSize=536870912 
-XX:MaxPermSize=536870912 -XX:MaxTenuringThreshold=31 
-XX:NewSize=536870912 -XX:NumberOfGCLogFiles=10 -XX:OldPLABSize=16 
-XX:ParallelGCThreads=16 -XX:PermSize=536870912 -XX:+PrintGC 
-XX:+PrintGCApplicationConcurrentTime -XX:+PrintGCApplicationStoppedTime 
-XX:+PrintGCDateStamps -XX:+PrintGCDetails -XX:+PrintGCTimeStamps 
-XX:+PrintHeapAtGC -XX:+PrintTenuringDistribution -XX:SurvivorRatio=6 
-XX:-UseAdaptiveSizePolicy -XX:+UseCompressedOops 
-XX:+UseConcMarkSweepGC -XX:+UseGCLogFileRotation -XX:+UseParNewGC
2015-11-09T01:29:55.640+0100: 38540,292: Total time for which 
application threads were stopped: 55,8119519 seconds, Stopping threads 
took: 0,0003857 seconds
2015-11-09T01:29:55.648+0100: 38540,299: Application time: 0,0076173 seconds

Note the 55.8 seconds pause directly after printing the flags and the 
consistent timestamp jump from 01:28:59 to 01:29:55. There's no GC 
output, although verbose GC is active and works. For some other reason 
there is a very long safepoint. Note also, that the time is not due to 
waiting until the safepoint is reached. At least the log claims that 
reaching the safepoint only took 0.008 seconds. Also at that timeof day 
the servers are not very busy.

Is there any idea, what happens here? Anything that rings a bell between 
1.7.0_76 and 1.7.0_80? Why should there be a long safepoint directly 
after GC rotation opened a new file?

I searched the bug parade, but didn't find a good hit. There was also 
nothing in the change for JDK-7164841 that seemed immediately 
responsible for a long pause.

Unfortunately this happens on a production system and the first thing 
was to roll back to the old Java version.Not sure, how good this will be 
reproducible on a test system (will check tomorrow).

Thanks for any hint,

Rainer

From rainer.jung at kippdata.de  Tue Nov 10 09:13:35 2015
From: rainer.jung at kippdata.de (Rainer Jung)
Date: Tue, 10 Nov 2015 10:13:35 +0100
Subject: Long safepoint pause directly after GC log file rotation in
	1.7.0_80
In-Reply-To: <564100BD.1070002@kippdata.de>
References: <564100BD.1070002@kippdata.de>
Message-ID: <5641B53F.2000709@kippdata.de>

Addition: the longest pause that was experienced was more than 2400 
seconds ...

And: platform is Solaris Sparc (T4). But we don't know whether it is 
platform dependent.

It also happens on test systems, so I'll write a script that calls 
pstack when detection is detected to find out, what the threads are 
doing or where they are hanging.

Regards,

Rainer

Am 09.11.2015 um 21:23 schrieb Rainer Jung:
> Hi,
>
> after upgrading from 1.7.0_76 to 1.7.0_80 we experience long pauses
> directly after a GC log rotation.
>
> The pause duration varies due to application and load but is in the
> range of 6 seconds to 60 seconds. There is no GC involved, i.e. no GC
> output is written related to these pauses.
>
> Example:
>
> Previous file ends with:
>
> 2015-11-09T01:28:36.832+0100: 38461,486: Application time: 5,2840810
> seconds
> {Heap before GC invocations=7366 (full 8):
> par new generation   total 458752K, used 442678K [0xfffffffe00400000,
> 0xfffffffe20400000, 0xfffffffe20400000)
>    eden space 393216K, 100% used [0xfffffffe00400000,
> 0xfffffffe18400000, 0xfffffffe18400000)
>    from space 65536K,  75% used [0xfffffffe18400000, 0xfffffffe1b44d998,
> 0xfffffffe1c400000)
>    to   space 65536K,   0% used [0xfffffffe1c400000, 0xfffffffe1c400000,
> 0xfffffffe20400000)
> concurrent mark-sweep generation total 3670016K, used 1887085K
> [0xfffffffe20400000, 0xffffffff00400000, 0xffffffff00400000)
> concurrent-mark-sweep perm gen total 524288K, used 453862K
> [0xffffffff00400000, 0xffffffff20400000, 0xffffffff20400000)
> 2015-11-09T01:28:36.839+0100: 38461,493:
> [GC2015-11-09T01:28:36.840+0100: 38461,493: [ParNew
> Desired survivor size 33554432 bytes, new threshold 16 (max 31)
> - age   1:    2964800 bytes,    2964800 total
> - age   2:    2628048 bytes,    5592848 total
> - age   3:    1415792 bytes,    7008640 total
> - age   4:    1354008 bytes,    8362648 total
> - age   5:    1132056 bytes,    9494704 total
> - age   6:    1334072 bytes,   10828776 total
> - age   7:    1407336 bytes,   12236112 total
> - age   8:    3321304 bytes,   15557416 total
> - age   9:    1531064 bytes,   17088480 total
> - age  10:    2453024 bytes,   19541504 total
> - age  11:    2797616 bytes,   22339120 total
> - age  12:    1698584 bytes,   24037704 total
> - age  13:    1870064 bytes,   25907768 total
> - age  14:    2211528 bytes,   28119296 total
> - age  15:    3626888 bytes,   31746184 total
> : 442678K->37742K(458752K), 0,0802687 secs]
> 2329763K->1924827K(4128768K), 0,0812120 secs] [Times: user=0,90
> sys=0,03, real=0,08 secs]
> Heap after GC invocations=7367 (full 8):
> par new generation   total 458752K, used 37742K [0xfffffffe00400000,
> 0xfffffffe20400000, 0xfffffffe20400000)
>    eden space 393216K,   0% used [0xfffffffe00400000,
> 0xfffffffe00400000, 0xfffffffe18400000)
>    from space 65536K,  57% used [0xfffffffe1c400000, 0xfffffffe1e8db9a0,
> 0xfffffffe20400000)
>    to   space 65536K,   0% used [0xfffffffe18400000, 0xfffffffe18400000,
> 0xfffffffe1c400000)
> concurrent mark-sweep generation total 3670016K, used 1887085K
> [0xfffffffe20400000, 0xffffffff00400000, 0xffffffff00400000)
> concurrent-mark-sweep perm gen total 524288K, used 453862K
> [0xffffffff00400000, 0xffffffff20400000, 0xffffffff20400000)
> }
> ....
> 2015-11-09T01:28:36.921+0100: 38461,575: Total time for which
> application threads were stopped: 0,0888232 seconds, Stopping threads
> took: 0,0005420 seconds
> 2015-11-09T01:28:59.821+0100: 38484,474: Application time: 0,0002954
> seconds
> 2015-11-09T01:28:59.823+0100: 38484,477: Total time for which
> application threads were stopped: 0,0026081 seconds, Stopping threads
> took: 0,0004146 seconds
> 2015-11-09T01:28:59.824+0100: 38484,477: Application time: 0,0003073
> seconds
> 2015-11-09T01:28:59.826+0100: 38484,480: Total time for which
> application threads were stopped: 0,0025411 seconds, Stopping threads
> took: 0,0004064 seconds
> 2015-11-09T01:28:59.827+0100: 38484,480: Application time: 0,0002885
> seconds
> 2015-11-09 01:28:59 GC log file has reached the maximum size. Saved as
> ./application/logs-a/mkb_gc.log.2
>
> This output looks normal. Last timestamp is 2015-11-09T01:28:59.827
>
> Now the next file begins:
>
> 2015-11-09 01:28:59 GC log file created ./application/logs-a/mkb_gc.log.3
> Java HotSpot(TM) 64-Bit Server VM (24.80-b11) for solaris-sparc JRE
> (1.7.0_80-b15), built on Apr 10 2015 18:47:18 by "" with Sun Studio 12u1
> Memory: 8k page, physical 133693440k(14956840k free)
> CommandLine flags: -XX:AllocateInstancePrefetchLines=2
> -XX:AllocatePrefetchInstr=1 -XX:AllocatePrefetchLines=6
> -XX:AllocatePrefetchStyle=3 -XX:+CMSClassUnloadingEnabled
> -XX:+ExplicitGCInvokesConcurrentAndUnloadsClasses
> -XX:GCLogFileSize=10485760 -XX:InitialHeapSize=4294967296
> -XX:MaxHeapSize=4294967296 -XX:MaxNewSize=536870912
> -XX:MaxPermSize=536870912 -XX:MaxTenuringThreshold=31
> -XX:NewSize=536870912 -XX:NumberOfGCLogFiles=10 -XX:OldPLABSize=16
> -XX:ParallelGCThreads=16 -XX:PermSize=536870912 -XX:+PrintGC
> -XX:+PrintGCApplicationConcurrentTime -XX:+PrintGCApplicationStoppedTime
> -XX:+PrintGCDateStamps -XX:+PrintGCDetails -XX:+PrintGCTimeStamps
> -XX:+PrintHeapAtGC -XX:+PrintTenuringDistribution -XX:SurvivorRatio=6
> -XX:-UseAdaptiveSizePolicy -XX:+UseCompressedOops
> -XX:+UseConcMarkSweepGC -XX:+UseGCLogFileRotation -XX:+UseParNewGC
> 2015-11-09T01:29:55.640+0100: 38540,292: Total time for which
> application threads were stopped: 55,8119519 seconds, Stopping threads
> took: 0,0003857 seconds
> 2015-11-09T01:29:55.648+0100: 38540,299: Application time: 0,0076173
> seconds
>
> Note the 55.8 seconds pause directly after printing the flags and the
> consistent timestamp jump from 01:28:59 to 01:29:55. There's no GC
> output, although verbose GC is active and works. For some other reason
> there is a very long safepoint. Note also, that the time is not due to
> waiting until the safepoint is reached. At least the log claims that
> reaching the safepoint only took 0.008 seconds. Also at that timeof day
> the servers are not very busy.
>
> Is there any idea, what happens here? Anything that rings a bell between
> 1.7.0_76 and 1.7.0_80? Why should there be a long safepoint directly
> after GC rotation opened a new file?
>
> I searched the bug parade, but didn't find a good hit. There was also
> nothing in the change for JDK-7164841 that seemed immediately
> responsible for a long pause.
>
> Unfortunately this happens on a production system and the first thing
> was to roll back to the old Java version.Not sure, how good this will be
> reproducible on a test system (will check tomorrow).
>
> Thanks for any hint,
>
> Rainer

From rainer.jung at kippdata.de  Tue Nov 10 12:47:34 2015
From: rainer.jung at kippdata.de (Rainer Jung)
Date: Tue, 10 Nov 2015 13:47:34 +0100
Subject: Long safepoint pause directly after GC log file rotation in
	1.7.0_80
In-Reply-To: <5641B53F.2000709@kippdata.de>
References: <564100BD.1070002@kippdata.de> <5641B53F.2000709@kippdata.de>
Message-ID: <5641E766.9050306@kippdata.de>

The pause is due to the call "(void) check_addr0(st)" in 
os::print_memory_info().

The call reads "/proc/self/map". In our case it has for instance 1400 
entries, and each read takes about 40 ms.

The same function check_addr0() is also used in 
os::run_periodic_checks(). Not sure why it is also done directly after 
each GC log rotation.

Regards,

Rainer

Am 10.11.2015 um 10:13 schrieb Rainer Jung:
> Addition: the longest pause that was experienced was more than 2400
> seconds ...
>
> And: platform is Solaris Sparc (T4). But we don't know whether it is
> platform dependent.
>
> It also happens on test systems, so I'll write a script that calls
> pstack when detection is detected to find out, what the threads are
> doing or where they are hanging.
>
> Regards,
>
> Rainer
>
> Am 09.11.2015 um 21:23 schrieb Rainer Jung:
>> Hi,
>>
>> after upgrading from 1.7.0_76 to 1.7.0_80 we experience long pauses
>> directly after a GC log rotation.
>>
>> The pause duration varies due to application and load but is in the
>> range of 6 seconds to 60 seconds. There is no GC involved, i.e. no GC
>> output is written related to these pauses.
>>
>> Example:
>>
>> Previous file ends with:
>>
>> 2015-11-09T01:28:36.832+0100: 38461,486: Application time: 5,2840810
>> seconds
>> {Heap before GC invocations=7366 (full 8):
>> par new generation   total 458752K, used 442678K [0xfffffffe00400000,
>> 0xfffffffe20400000, 0xfffffffe20400000)
>>    eden space 393216K, 100% used [0xfffffffe00400000,
>> 0xfffffffe18400000, 0xfffffffe18400000)
>>    from space 65536K,  75% used [0xfffffffe18400000, 0xfffffffe1b44d998,
>> 0xfffffffe1c400000)
>>    to   space 65536K,   0% used [0xfffffffe1c400000, 0xfffffffe1c400000,
>> 0xfffffffe20400000)
>> concurrent mark-sweep generation total 3670016K, used 1887085K
>> [0xfffffffe20400000, 0xffffffff00400000, 0xffffffff00400000)
>> concurrent-mark-sweep perm gen total 524288K, used 453862K
>> [0xffffffff00400000, 0xffffffff20400000, 0xffffffff20400000)
>> 2015-11-09T01:28:36.839+0100: 38461,493:
>> [GC2015-11-09T01:28:36.840+0100: 38461,493: [ParNew
>> Desired survivor size 33554432 bytes, new threshold 16 (max 31)
>> - age   1:    2964800 bytes,    2964800 total
>> - age   2:    2628048 bytes,    5592848 total
>> - age   3:    1415792 bytes,    7008640 total
>> - age   4:    1354008 bytes,    8362648 total
>> - age   5:    1132056 bytes,    9494704 total
>> - age   6:    1334072 bytes,   10828776 total
>> - age   7:    1407336 bytes,   12236112 total
>> - age   8:    3321304 bytes,   15557416 total
>> - age   9:    1531064 bytes,   17088480 total
>> - age  10:    2453024 bytes,   19541504 total
>> - age  11:    2797616 bytes,   22339120 total
>> - age  12:    1698584 bytes,   24037704 total
>> - age  13:    1870064 bytes,   25907768 total
>> - age  14:    2211528 bytes,   28119296 total
>> - age  15:    3626888 bytes,   31746184 total
>> : 442678K->37742K(458752K), 0,0802687 secs]
>> 2329763K->1924827K(4128768K), 0,0812120 secs] [Times: user=0,90
>> sys=0,03, real=0,08 secs]
>> Heap after GC invocations=7367 (full 8):
>> par new generation   total 458752K, used 37742K [0xfffffffe00400000,
>> 0xfffffffe20400000, 0xfffffffe20400000)
>>    eden space 393216K,   0% used [0xfffffffe00400000,
>> 0xfffffffe00400000, 0xfffffffe18400000)
>>    from space 65536K,  57% used [0xfffffffe1c400000, 0xfffffffe1e8db9a0,
>> 0xfffffffe20400000)
>>    to   space 65536K,   0% used [0xfffffffe18400000, 0xfffffffe18400000,
>> 0xfffffffe1c400000)
>> concurrent mark-sweep generation total 3670016K, used 1887085K
>> [0xfffffffe20400000, 0xffffffff00400000, 0xffffffff00400000)
>> concurrent-mark-sweep perm gen total 524288K, used 453862K
>> [0xffffffff00400000, 0xffffffff20400000, 0xffffffff20400000)
>> }
>> ....
>> 2015-11-09T01:28:36.921+0100: 38461,575: Total time for which
>> application threads were stopped: 0,0888232 seconds, Stopping threads
>> took: 0,0005420 seconds
>> 2015-11-09T01:28:59.821+0100: 38484,474: Application time: 0,0002954
>> seconds
>> 2015-11-09T01:28:59.823+0100: 38484,477: Total time for which
>> application threads were stopped: 0,0026081 seconds, Stopping threads
>> took: 0,0004146 seconds
>> 2015-11-09T01:28:59.824+0100: 38484,477: Application time: 0,0003073
>> seconds
>> 2015-11-09T01:28:59.826+0100: 38484,480: Total time for which
>> application threads were stopped: 0,0025411 seconds, Stopping threads
>> took: 0,0004064 seconds
>> 2015-11-09T01:28:59.827+0100: 38484,480: Application time: 0,0002885
>> seconds
>> 2015-11-09 01:28:59 GC log file has reached the maximum size. Saved as
>> ./application/logs-a/mkb_gc.log.2
>>
>> This output looks normal. Last timestamp is 2015-11-09T01:28:59.827
>>
>> Now the next file begins:
>>
>> 2015-11-09 01:28:59 GC log file created ./application/logs-a/mkb_gc.log.3
>> Java HotSpot(TM) 64-Bit Server VM (24.80-b11) for solaris-sparc JRE
>> (1.7.0_80-b15), built on Apr 10 2015 18:47:18 by "" with Sun Studio 12u1
>> Memory: 8k page, physical 133693440k(14956840k free)
>> CommandLine flags: -XX:AllocateInstancePrefetchLines=2
>> -XX:AllocatePrefetchInstr=1 -XX:AllocatePrefetchLines=6
>> -XX:AllocatePrefetchStyle=3 -XX:+CMSClassUnloadingEnabled
>> -XX:+ExplicitGCInvokesConcurrentAndUnloadsClasses
>> -XX:GCLogFileSize=10485760 -XX:InitialHeapSize=4294967296
>> -XX:MaxHeapSize=4294967296 -XX:MaxNewSize=536870912
>> -XX:MaxPermSize=536870912 -XX:MaxTenuringThreshold=31
>> -XX:NewSize=536870912 -XX:NumberOfGCLogFiles=10 -XX:OldPLABSize=16
>> -XX:ParallelGCThreads=16 -XX:PermSize=536870912 -XX:+PrintGC
>> -XX:+PrintGCApplicationConcurrentTime -XX:+PrintGCApplicationStoppedTime
>> -XX:+PrintGCDateStamps -XX:+PrintGCDetails -XX:+PrintGCTimeStamps
>> -XX:+PrintHeapAtGC -XX:+PrintTenuringDistribution -XX:SurvivorRatio=6
>> -XX:-UseAdaptiveSizePolicy -XX:+UseCompressedOops
>> -XX:+UseConcMarkSweepGC -XX:+UseGCLogFileRotation -XX:+UseParNewGC
>> 2015-11-09T01:29:55.640+0100: 38540,292: Total time for which
>> application threads were stopped: 55,8119519 seconds, Stopping threads
>> took: 0,0003857 seconds
>> 2015-11-09T01:29:55.648+0100: 38540,299: Application time: 0,0076173
>> seconds
>>
>> Note the 55.8 seconds pause directly after printing the flags and the
>> consistent timestamp jump from 01:28:59 to 01:29:55. There's no GC
>> output, although verbose GC is active and works. For some other reason
>> there is a very long safepoint. Note also, that the time is not due to
>> waiting until the safepoint is reached. At least the log claims that
>> reaching the safepoint only took 0.008 seconds. Also at that timeof day
>> the servers are not very busy.
>>
>> Is there any idea, what happens here? Anything that rings a bell between
>> 1.7.0_76 and 1.7.0_80? Why should there be a long safepoint directly
>> after GC rotation opened a new file?
>>
>> I searched the bug parade, but didn't find a good hit. There was also
>> nothing in the change for JDK-7164841 that seemed immediately
>> responsible for a long pause.
>>
>> Unfortunately this happens on a production system and the first thing
>> was to roll back to the old Java version.Not sure, how good this will be
>> reproducible on a test system (will check tomorrow).
>>
>> Thanks for any hint,
>>
>> Rainer

From rainer.jung at kippdata.de  Tue Nov 10 13:24:02 2015
From: rainer.jung at kippdata.de (Rainer Jung)
Date: Tue, 10 Nov 2015 14:24:02 +0100
Subject: Long safepoint pause directly after GC log file rotation in
	1.7.0_80
In-Reply-To: <5641E766.9050306@kippdata.de>
References: <564100BD.1070002@kippdata.de> <5641B53F.2000709@kippdata.de>
	<5641E766.9050306@kippdata.de>
Message-ID: <5641EFF2.4020300@kippdata.de>

Am 10.11.2015 um 13:47 schrieb Rainer Jung:
> The pause is due to the call "(void) check_addr0(st)" in
> os::print_memory_info().
>
> The call reads "/proc/self/map". In our case it has for instance 1400
> entries, and each read takes about 40 ms.
>
> The same function check_addr0() is also used in
> os::run_periodic_checks(). Not sure why it is also done directly after
> each GC log rotation.

Note that the solaris command pmap reads the same file in one go using 
the pread() call instead of reading lots of entries one by one. 
Therefore pmap executes very quick, but check_addr0() not.

If I copy the check_addr0() code in a separate executable and run it 
from there, reading the map file of the Java process, I can reproduce 
the long runtime.

So it seems this is unfortunately an example of bad (very inefficient) 
programming.

The only thing I could not yet analyze, is what the difference to 
1.7.0_76 is, where the problem doesn't happen. Work in progress.

It looks like we should open a bug?

Regards,

Rainer

> Am 10.11.2015 um 10:13 schrieb Rainer Jung:
>> Addition: the longest pause that was experienced was more than 2400
>> seconds ...
>>
>> And: platform is Solaris Sparc (T4). But we don't know whether it is
>> platform dependent.
>>
>> It also happens on test systems, so I'll write a script that calls
>> pstack when detection is detected to find out, what the threads are
>> doing or where they are hanging.
>>
>> Regards,
>>
>> Rainer
>>
>> Am 09.11.2015 um 21:23 schrieb Rainer Jung:
>>> Hi,
>>>
>>> after upgrading from 1.7.0_76 to 1.7.0_80 we experience long pauses
>>> directly after a GC log rotation.
>>>
>>> The pause duration varies due to application and load but is in the
>>> range of 6 seconds to 60 seconds. There is no GC involved, i.e. no GC
>>> output is written related to these pauses.
>>>
>>> Example:
>>>
>>> Previous file ends with:
>>>
>>> 2015-11-09T01:28:36.832+0100: 38461,486: Application time: 5,2840810
>>> seconds
>>> {Heap before GC invocations=7366 (full 8):
>>> par new generation   total 458752K, used 442678K [0xfffffffe00400000,
>>> 0xfffffffe20400000, 0xfffffffe20400000)
>>>    eden space 393216K, 100% used [0xfffffffe00400000,
>>> 0xfffffffe18400000, 0xfffffffe18400000)
>>>    from space 65536K,  75% used [0xfffffffe18400000, 0xfffffffe1b44d998,
>>> 0xfffffffe1c400000)
>>>    to   space 65536K,   0% used [0xfffffffe1c400000, 0xfffffffe1c400000,
>>> 0xfffffffe20400000)
>>> concurrent mark-sweep generation total 3670016K, used 1887085K
>>> [0xfffffffe20400000, 0xffffffff00400000, 0xffffffff00400000)
>>> concurrent-mark-sweep perm gen total 524288K, used 453862K
>>> [0xffffffff00400000, 0xffffffff20400000, 0xffffffff20400000)
>>> 2015-11-09T01:28:36.839+0100: 38461,493:
>>> [GC2015-11-09T01:28:36.840+0100: 38461,493: [ParNew
>>> Desired survivor size 33554432 bytes, new threshold 16 (max 31)
>>> - age   1:    2964800 bytes,    2964800 total
>>> - age   2:    2628048 bytes,    5592848 total
>>> - age   3:    1415792 bytes,    7008640 total
>>> - age   4:    1354008 bytes,    8362648 total
>>> - age   5:    1132056 bytes,    9494704 total
>>> - age   6:    1334072 bytes,   10828776 total
>>> - age   7:    1407336 bytes,   12236112 total
>>> - age   8:    3321304 bytes,   15557416 total
>>> - age   9:    1531064 bytes,   17088480 total
>>> - age  10:    2453024 bytes,   19541504 total
>>> - age  11:    2797616 bytes,   22339120 total
>>> - age  12:    1698584 bytes,   24037704 total
>>> - age  13:    1870064 bytes,   25907768 total
>>> - age  14:    2211528 bytes,   28119296 total
>>> - age  15:    3626888 bytes,   31746184 total
>>> : 442678K->37742K(458752K), 0,0802687 secs]
>>> 2329763K->1924827K(4128768K), 0,0812120 secs] [Times: user=0,90
>>> sys=0,03, real=0,08 secs]
>>> Heap after GC invocations=7367 (full 8):
>>> par new generation   total 458752K, used 37742K [0xfffffffe00400000,
>>> 0xfffffffe20400000, 0xfffffffe20400000)
>>>    eden space 393216K,   0% used [0xfffffffe00400000,
>>> 0xfffffffe00400000, 0xfffffffe18400000)
>>>    from space 65536K,  57% used [0xfffffffe1c400000, 0xfffffffe1e8db9a0,
>>> 0xfffffffe20400000)
>>>    to   space 65536K,   0% used [0xfffffffe18400000, 0xfffffffe18400000,
>>> 0xfffffffe1c400000)
>>> concurrent mark-sweep generation total 3670016K, used 1887085K
>>> [0xfffffffe20400000, 0xffffffff00400000, 0xffffffff00400000)
>>> concurrent-mark-sweep perm gen total 524288K, used 453862K
>>> [0xffffffff00400000, 0xffffffff20400000, 0xffffffff20400000)
>>> }
>>> ....
>>> 2015-11-09T01:28:36.921+0100: 38461,575: Total time for which
>>> application threads were stopped: 0,0888232 seconds, Stopping threads
>>> took: 0,0005420 seconds
>>> 2015-11-09T01:28:59.821+0100: 38484,474: Application time: 0,0002954
>>> seconds
>>> 2015-11-09T01:28:59.823+0100: 38484,477: Total time for which
>>> application threads were stopped: 0,0026081 seconds, Stopping threads
>>> took: 0,0004146 seconds
>>> 2015-11-09T01:28:59.824+0100: 38484,477: Application time: 0,0003073
>>> seconds
>>> 2015-11-09T01:28:59.826+0100: 38484,480: Total time for which
>>> application threads were stopped: 0,0025411 seconds, Stopping threads
>>> took: 0,0004064 seconds
>>> 2015-11-09T01:28:59.827+0100: 38484,480: Application time: 0,0002885
>>> seconds
>>> 2015-11-09 01:28:59 GC log file has reached the maximum size. Saved as
>>> ./application/logs-a/mkb_gc.log.2
>>>
>>> This output looks normal. Last timestamp is 2015-11-09T01:28:59.827
>>>
>>> Now the next file begins:
>>>
>>> 2015-11-09 01:28:59 GC log file created
>>> ./application/logs-a/mkb_gc.log.3
>>> Java HotSpot(TM) 64-Bit Server VM (24.80-b11) for solaris-sparc JRE
>>> (1.7.0_80-b15), built on Apr 10 2015 18:47:18 by "" with Sun Studio 12u1
>>> Memory: 8k page, physical 133693440k(14956840k free)
>>> CommandLine flags: -XX:AllocateInstancePrefetchLines=2
>>> -XX:AllocatePrefetchInstr=1 -XX:AllocatePrefetchLines=6
>>> -XX:AllocatePrefetchStyle=3 -XX:+CMSClassUnloadingEnabled
>>> -XX:+ExplicitGCInvokesConcurrentAndUnloadsClasses
>>> -XX:GCLogFileSize=10485760 -XX:InitialHeapSize=4294967296
>>> -XX:MaxHeapSize=4294967296 -XX:MaxNewSize=536870912
>>> -XX:MaxPermSize=536870912 -XX:MaxTenuringThreshold=31
>>> -XX:NewSize=536870912 -XX:NumberOfGCLogFiles=10 -XX:OldPLABSize=16
>>> -XX:ParallelGCThreads=16 -XX:PermSize=536870912 -XX:+PrintGC
>>> -XX:+PrintGCApplicationConcurrentTime -XX:+PrintGCApplicationStoppedTime
>>> -XX:+PrintGCDateStamps -XX:+PrintGCDetails -XX:+PrintGCTimeStamps
>>> -XX:+PrintHeapAtGC -XX:+PrintTenuringDistribution -XX:SurvivorRatio=6
>>> -XX:-UseAdaptiveSizePolicy -XX:+UseCompressedOops
>>> -XX:+UseConcMarkSweepGC -XX:+UseGCLogFileRotation -XX:+UseParNewGC
>>> 2015-11-09T01:29:55.640+0100: 38540,292: Total time for which
>>> application threads were stopped: 55,8119519 seconds, Stopping threads
>>> took: 0,0003857 seconds
>>> 2015-11-09T01:29:55.648+0100: 38540,299: Application time: 0,0076173
>>> seconds
>>>
>>> Note the 55.8 seconds pause directly after printing the flags and the
>>> consistent timestamp jump from 01:28:59 to 01:29:55. There's no GC
>>> output, although verbose GC is active and works. For some other reason
>>> there is a very long safepoint. Note also, that the time is not due to
>>> waiting until the safepoint is reached. At least the log claims that
>>> reaching the safepoint only took 0.008 seconds. Also at that timeof day
>>> the servers are not very busy.
>>>
>>> Is there any idea, what happens here? Anything that rings a bell between
>>> 1.7.0_76 and 1.7.0_80? Why should there be a long safepoint directly
>>> after GC rotation opened a new file?
>>>
>>> I searched the bug parade, but didn't find a good hit. There was also
>>> nothing in the change for JDK-7164841 that seemed immediately
>>> responsible for a long pause.
>>>
>>> Unfortunately this happens on a production system and the first thing
>>> was to roll back to the old Java version.Not sure, how good this will be
>>> reproducible on a test system (will check tomorrow).
>>>
>>> Thanks for any hint,
>>>
>>> Rainer
> _______________________________________________
> hotspot-gc-use mailing list
> hotspot-gc-use at openjdk.java.net
> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use

From ceeaspb at gmail.com  Tue Nov 10 13:31:49 2015
From: ceeaspb at gmail.com (Alex Bagehot)
Date: Tue, 10 Nov 2015 13:31:49 +0000
Subject: Long safepoint pause directly after GC log file rotation in
	1.7.0_80
In-Reply-To: <5641E766.9050306@kippdata.de>
References: <564100BD.1070002@kippdata.de> <5641B53F.2000709@kippdata.de>
	<5641E766.9050306@kippdata.de>
Message-ID: <CAHeneC9E2QLDUpskOsO8xtbH8GTAO=RoFiqBZYA=Mtxorqd=fw@mail.gmail.com>

http://bugs.java.com/bugdatabase/view_bug.do?bug_id=6478739 relates (&
states it is only for solaris)

On Tue, Nov 10, 2015 at 12:47 PM, Rainer Jung <rainer.jung at kippdata.de> wrote:
> The pause is due to the call "(void) check_addr0(st)" in
> os::print_memory_info().
>
> The call reads "/proc/self/map". In our case it has for instance 1400
> entries, and each read takes about 40 ms.
>
> The same function check_addr0() is also used in os::run_periodic_checks().
> Not sure why it is also done directly after each GC log rotation.
>
> Regards,
>
> Rainer
>
>
> Am 10.11.2015 um 10:13 schrieb Rainer Jung:
>>
>> Addition: the longest pause that was experienced was more than 2400
>> seconds ...
>>
>> And: platform is Solaris Sparc (T4). But we don't know whether it is
>> platform dependent.
>>
>> It also happens on test systems, so I'll write a script that calls
>> pstack when detection is detected to find out, what the threads are
>> doing or where they are hanging.
>>
>> Regards,
>>
>> Rainer
>>
>> Am 09.11.2015 um 21:23 schrieb Rainer Jung:
>>>
>>> Hi,
>>>
>>> after upgrading from 1.7.0_76 to 1.7.0_80 we experience long pauses
>>> directly after a GC log rotation.
>>>
>>> The pause duration varies due to application and load but is in the
>>> range of 6 seconds to 60 seconds. There is no GC involved, i.e. no GC
>>> output is written related to these pauses.
>>>
>>> Example:
>>>
>>> Previous file ends with:
>>>
>>> 2015-11-09T01:28:36.832+0100: 38461,486: Application time: 5,2840810
>>> seconds
>>> {Heap before GC invocations=7366 (full 8):
>>> par new generation   total 458752K, used 442678K [0xfffffffe00400000,
>>> 0xfffffffe20400000, 0xfffffffe20400000)
>>>    eden space 393216K, 100% used [0xfffffffe00400000,
>>> 0xfffffffe18400000, 0xfffffffe18400000)
>>>    from space 65536K,  75% used [0xfffffffe18400000, 0xfffffffe1b44d998,
>>> 0xfffffffe1c400000)
>>>    to   space 65536K,   0% used [0xfffffffe1c400000, 0xfffffffe1c400000,
>>> 0xfffffffe20400000)
>>> concurrent mark-sweep generation total 3670016K, used 1887085K
>>> [0xfffffffe20400000, 0xffffffff00400000, 0xffffffff00400000)
>>> concurrent-mark-sweep perm gen total 524288K, used 453862K
>>> [0xffffffff00400000, 0xffffffff20400000, 0xffffffff20400000)
>>> 2015-11-09T01:28:36.839+0100: 38461,493:
>>> [GC2015-11-09T01:28:36.840+0100: 38461,493: [ParNew
>>> Desired survivor size 33554432 bytes, new threshold 16 (max 31)
>>> - age   1:    2964800 bytes,    2964800 total
>>> - age   2:    2628048 bytes,    5592848 total
>>> - age   3:    1415792 bytes,    7008640 total
>>> - age   4:    1354008 bytes,    8362648 total
>>> - age   5:    1132056 bytes,    9494704 total
>>> - age   6:    1334072 bytes,   10828776 total
>>> - age   7:    1407336 bytes,   12236112 total
>>> - age   8:    3321304 bytes,   15557416 total
>>> - age   9:    1531064 bytes,   17088480 total
>>> - age  10:    2453024 bytes,   19541504 total
>>> - age  11:    2797616 bytes,   22339120 total
>>> - age  12:    1698584 bytes,   24037704 total
>>> - age  13:    1870064 bytes,   25907768 total
>>> - age  14:    2211528 bytes,   28119296 total
>>> - age  15:    3626888 bytes,   31746184 total
>>> : 442678K->37742K(458752K), 0,0802687 secs]
>>> 2329763K->1924827K(4128768K), 0,0812120 secs] [Times: user=0,90
>>> sys=0,03, real=0,08 secs]
>>> Heap after GC invocations=7367 (full 8):
>>> par new generation   total 458752K, used 37742K [0xfffffffe00400000,
>>> 0xfffffffe20400000, 0xfffffffe20400000)
>>>    eden space 393216K,   0% used [0xfffffffe00400000,
>>> 0xfffffffe00400000, 0xfffffffe18400000)
>>>    from space 65536K,  57% used [0xfffffffe1c400000, 0xfffffffe1e8db9a0,
>>> 0xfffffffe20400000)
>>>    to   space 65536K,   0% used [0xfffffffe18400000, 0xfffffffe18400000,
>>> 0xfffffffe1c400000)
>>> concurrent mark-sweep generation total 3670016K, used 1887085K
>>> [0xfffffffe20400000, 0xffffffff00400000, 0xffffffff00400000)
>>> concurrent-mark-sweep perm gen total 524288K, used 453862K
>>> [0xffffffff00400000, 0xffffffff20400000, 0xffffffff20400000)
>>> }
>>> ....
>>> 2015-11-09T01:28:36.921+0100: 38461,575: Total time for which
>>> application threads were stopped: 0,0888232 seconds, Stopping threads
>>> took: 0,0005420 seconds
>>> 2015-11-09T01:28:59.821+0100: 38484,474: Application time: 0,0002954
>>> seconds
>>> 2015-11-09T01:28:59.823+0100: 38484,477: Total time for which
>>> application threads were stopped: 0,0026081 seconds, Stopping threads
>>> took: 0,0004146 seconds
>>> 2015-11-09T01:28:59.824+0100: 38484,477: Application time: 0,0003073
>>> seconds
>>> 2015-11-09T01:28:59.826+0100: 38484,480: Total time for which
>>> application threads were stopped: 0,0025411 seconds, Stopping threads
>>> took: 0,0004064 seconds
>>> 2015-11-09T01:28:59.827+0100: 38484,480: Application time: 0,0002885
>>> seconds
>>> 2015-11-09 01:28:59 GC log file has reached the maximum size. Saved as
>>> ./application/logs-a/mkb_gc.log.2
>>>
>>> This output looks normal. Last timestamp is 2015-11-09T01:28:59.827
>>>
>>> Now the next file begins:
>>>
>>> 2015-11-09 01:28:59 GC log file created ./application/logs-a/mkb_gc.log.3
>>> Java HotSpot(TM) 64-Bit Server VM (24.80-b11) for solaris-sparc JRE
>>> (1.7.0_80-b15), built on Apr 10 2015 18:47:18 by "" with Sun Studio 12u1
>>> Memory: 8k page, physical 133693440k(14956840k free)
>>> CommandLine flags: -XX:AllocateInstancePrefetchLines=2
>>> -XX:AllocatePrefetchInstr=1 -XX:AllocatePrefetchLines=6
>>> -XX:AllocatePrefetchStyle=3 -XX:+CMSClassUnloadingEnabled
>>> -XX:+ExplicitGCInvokesConcurrentAndUnloadsClasses
>>> -XX:GCLogFileSize=10485760 -XX:InitialHeapSize=4294967296
>>> -XX:MaxHeapSize=4294967296 -XX:MaxNewSize=536870912
>>> -XX:MaxPermSize=536870912 -XX:MaxTenuringThreshold=31
>>> -XX:NewSize=536870912 -XX:NumberOfGCLogFiles=10 -XX:OldPLABSize=16
>>> -XX:ParallelGCThreads=16 -XX:PermSize=536870912 -XX:+PrintGC
>>> -XX:+PrintGCApplicationConcurrentTime -XX:+PrintGCApplicationStoppedTime
>>> -XX:+PrintGCDateStamps -XX:+PrintGCDetails -XX:+PrintGCTimeStamps
>>> -XX:+PrintHeapAtGC -XX:+PrintTenuringDistribution -XX:SurvivorRatio=6
>>> -XX:-UseAdaptiveSizePolicy -XX:+UseCompressedOops
>>> -XX:+UseConcMarkSweepGC -XX:+UseGCLogFileRotation -XX:+UseParNewGC
>>> 2015-11-09T01:29:55.640+0100: 38540,292: Total time for which
>>> application threads were stopped: 55,8119519 seconds, Stopping threads
>>> took: 0,0003857 seconds
>>> 2015-11-09T01:29:55.648+0100: 38540,299: Application time: 0,0076173
>>> seconds
>>>
>>> Note the 55.8 seconds pause directly after printing the flags and the
>>> consistent timestamp jump from 01:28:59 to 01:29:55. There's no GC
>>> output, although verbose GC is active and works. For some other reason
>>> there is a very long safepoint. Note also, that the time is not due to
>>> waiting until the safepoint is reached. At least the log claims that
>>> reaching the safepoint only took 0.008 seconds. Also at that timeof day
>>> the servers are not very busy.
>>>
>>> Is there any idea, what happens here? Anything that rings a bell between
>>> 1.7.0_76 and 1.7.0_80? Why should there be a long safepoint directly
>>> after GC rotation opened a new file?
>>>
>>> I searched the bug parade, but didn't find a good hit. There was also
>>> nothing in the change for JDK-7164841 that seemed immediately
>>> responsible for a long pause.
>>>
>>> Unfortunately this happens on a production system and the first thing
>>> was to roll back to the old Java version.Not sure, how good this will be
>>> reproducible on a test system (will check tomorrow).
>>>
>>> Thanks for any hint,
>>>
>>> Rainer
>
> _______________________________________________
> hotspot-gc-use mailing list
> hotspot-gc-use at openjdk.java.net
> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use

From rainer.jung at kippdata.de  Tue Nov 10 14:53:09 2015
From: rainer.jung at kippdata.de (Rainer Jung)
Date: Tue, 10 Nov 2015 15:53:09 +0100
Subject: Long safepoint pause directly after GC log file rotation in
	1.7.0_80
In-Reply-To: <CAHeneC9E2QLDUpskOsO8xtbH8GTAO=RoFiqBZYA=Mtxorqd=fw@mail.gmail.com>
References: <564100BD.1070002@kippdata.de> <5641B53F.2000709@kippdata.de>
	<5641E766.9050306@kippdata.de>
	<CAHeneC9E2QLDUpskOsO8xtbH8GTAO=RoFiqBZYA=Mtxorqd=fw@mail.gmail.com>
Message-ID: <564204D5.8080406@kippdata.de>

Am 10.11.2015 um 14:31 schrieb Alex Bagehot:
> http://bugs.java.com/bugdatabase/view_bug.do?bug_id=6478739 relates (&
> states it is only for solaris)

Thanks for the explanation. I now found, why the problem happens on 80 
but not on 76.

Both versions contain the same very inefficient code. But in 76 there 
was a bug that closed the map file after reading the first entry, becuse 
the close() was done inside the read loop instead of after the read 
loop. That bug was fixed in 80, so that now actually the whole file is 
read and not only the first item. Only then does the performance bug 
exhibit itself.

The responsible commit  is

http://hg.openjdk.java.net/jdk7u/jdk7u/hotspot/diff/e50eb3195734/src/os/solaris/vm/os_solaris.cpp

But the real bug is reading the map file item by item instead of using 
pread() to read it all at one like pmap does it.

I expect the problem to be also in 1.8.0 and 1.9.0 (to be checked).

What's the right list to tell about this hotspot issue?

Regards,

Rainer

> On Tue, Nov 10, 2015 at 12:47 PM, Rainer Jung <rainer.jung at kippdata.de> wrote:
>> The pause is due to the call "(void) check_addr0(st)" in
>> os::print_memory_info().
>>
>> The call reads "/proc/self/map". In our case it has for instance 1400
>> entries, and each read takes about 40 ms.
>>
>> The same function check_addr0() is also used in os::run_periodic_checks().
>> Not sure why it is also done directly after each GC log rotation.
>>
>> Regards,
>>
>> Rainer
>>
>>
>> Am 10.11.2015 um 10:13 schrieb Rainer Jung:
>>>
>>> Addition: the longest pause that was experienced was more than 2400
>>> seconds ...
>>>
>>> And: platform is Solaris Sparc (T4). But we don't know whether it is
>>> platform dependent.
>>>
>>> It also happens on test systems, so I'll write a script that calls
>>> pstack when detection is detected to find out, what the threads are
>>> doing or where they are hanging.
>>>
>>> Regards,
>>>
>>> Rainer
>>>
>>> Am 09.11.2015 um 21:23 schrieb Rainer Jung:
>>>>
>>>> Hi,
>>>>
>>>> after upgrading from 1.7.0_76 to 1.7.0_80 we experience long pauses
>>>> directly after a GC log rotation.
>>>>
>>>> The pause duration varies due to application and load but is in the
>>>> range of 6 seconds to 60 seconds. There is no GC involved, i.e. no GC
>>>> output is written related to these pauses.
>>>>
>>>> Example:
>>>>
>>>> Previous file ends with:
>>>>
>>>> 2015-11-09T01:28:36.832+0100: 38461,486: Application time: 5,2840810
>>>> seconds
>>>> {Heap before GC invocations=7366 (full 8):
>>>> par new generation   total 458752K, used 442678K [0xfffffffe00400000,
>>>> 0xfffffffe20400000, 0xfffffffe20400000)
>>>>     eden space 393216K, 100% used [0xfffffffe00400000,
>>>> 0xfffffffe18400000, 0xfffffffe18400000)
>>>>     from space 65536K,  75% used [0xfffffffe18400000, 0xfffffffe1b44d998,
>>>> 0xfffffffe1c400000)
>>>>     to   space 65536K,   0% used [0xfffffffe1c400000, 0xfffffffe1c400000,
>>>> 0xfffffffe20400000)
>>>> concurrent mark-sweep generation total 3670016K, used 1887085K
>>>> [0xfffffffe20400000, 0xffffffff00400000, 0xffffffff00400000)
>>>> concurrent-mark-sweep perm gen total 524288K, used 453862K
>>>> [0xffffffff00400000, 0xffffffff20400000, 0xffffffff20400000)
>>>> 2015-11-09T01:28:36.839+0100: 38461,493:
>>>> [GC2015-11-09T01:28:36.840+0100: 38461,493: [ParNew
>>>> Desired survivor size 33554432 bytes, new threshold 16 (max 31)
>>>> - age   1:    2964800 bytes,    2964800 total
>>>> - age   2:    2628048 bytes,    5592848 total
>>>> - age   3:    1415792 bytes,    7008640 total
>>>> - age   4:    1354008 bytes,    8362648 total
>>>> - age   5:    1132056 bytes,    9494704 total
>>>> - age   6:    1334072 bytes,   10828776 total
>>>> - age   7:    1407336 bytes,   12236112 total
>>>> - age   8:    3321304 bytes,   15557416 total
>>>> - age   9:    1531064 bytes,   17088480 total
>>>> - age  10:    2453024 bytes,   19541504 total
>>>> - age  11:    2797616 bytes,   22339120 total
>>>> - age  12:    1698584 bytes,   24037704 total
>>>> - age  13:    1870064 bytes,   25907768 total
>>>> - age  14:    2211528 bytes,   28119296 total
>>>> - age  15:    3626888 bytes,   31746184 total
>>>> : 442678K->37742K(458752K), 0,0802687 secs]
>>>> 2329763K->1924827K(4128768K), 0,0812120 secs] [Times: user=0,90
>>>> sys=0,03, real=0,08 secs]
>>>> Heap after GC invocations=7367 (full 8):
>>>> par new generation   total 458752K, used 37742K [0xfffffffe00400000,
>>>> 0xfffffffe20400000, 0xfffffffe20400000)
>>>>     eden space 393216K,   0% used [0xfffffffe00400000,
>>>> 0xfffffffe00400000, 0xfffffffe18400000)
>>>>     from space 65536K,  57% used [0xfffffffe1c400000, 0xfffffffe1e8db9a0,
>>>> 0xfffffffe20400000)
>>>>     to   space 65536K,   0% used [0xfffffffe18400000, 0xfffffffe18400000,
>>>> 0xfffffffe1c400000)
>>>> concurrent mark-sweep generation total 3670016K, used 1887085K
>>>> [0xfffffffe20400000, 0xffffffff00400000, 0xffffffff00400000)
>>>> concurrent-mark-sweep perm gen total 524288K, used 453862K
>>>> [0xffffffff00400000, 0xffffffff20400000, 0xffffffff20400000)
>>>> }
>>>> ....
>>>> 2015-11-09T01:28:36.921+0100: 38461,575: Total time for which
>>>> application threads were stopped: 0,0888232 seconds, Stopping threads
>>>> took: 0,0005420 seconds
>>>> 2015-11-09T01:28:59.821+0100: 38484,474: Application time: 0,0002954
>>>> seconds
>>>> 2015-11-09T01:28:59.823+0100: 38484,477: Total time for which
>>>> application threads were stopped: 0,0026081 seconds, Stopping threads
>>>> took: 0,0004146 seconds
>>>> 2015-11-09T01:28:59.824+0100: 38484,477: Application time: 0,0003073
>>>> seconds
>>>> 2015-11-09T01:28:59.826+0100: 38484,480: Total time for which
>>>> application threads were stopped: 0,0025411 seconds, Stopping threads
>>>> took: 0,0004064 seconds
>>>> 2015-11-09T01:28:59.827+0100: 38484,480: Application time: 0,0002885
>>>> seconds
>>>> 2015-11-09 01:28:59 GC log file has reached the maximum size. Saved as
>>>> ./application/logs-a/mkb_gc.log.2
>>>>
>>>> This output looks normal. Last timestamp is 2015-11-09T01:28:59.827
>>>>
>>>> Now the next file begins:
>>>>
>>>> 2015-11-09 01:28:59 GC log file created ./application/logs-a/mkb_gc.log.3
>>>> Java HotSpot(TM) 64-Bit Server VM (24.80-b11) for solaris-sparc JRE
>>>> (1.7.0_80-b15), built on Apr 10 2015 18:47:18 by "" with Sun Studio 12u1
>>>> Memory: 8k page, physical 133693440k(14956840k free)
>>>> CommandLine flags: -XX:AllocateInstancePrefetchLines=2
>>>> -XX:AllocatePrefetchInstr=1 -XX:AllocatePrefetchLines=6
>>>> -XX:AllocatePrefetchStyle=3 -XX:+CMSClassUnloadingEnabled
>>>> -XX:+ExplicitGCInvokesConcurrentAndUnloadsClasses
>>>> -XX:GCLogFileSize=10485760 -XX:InitialHeapSize=4294967296
>>>> -XX:MaxHeapSize=4294967296 -XX:MaxNewSize=536870912
>>>> -XX:MaxPermSize=536870912 -XX:MaxTenuringThreshold=31
>>>> -XX:NewSize=536870912 -XX:NumberOfGCLogFiles=10 -XX:OldPLABSize=16
>>>> -XX:ParallelGCThreads=16 -XX:PermSize=536870912 -XX:+PrintGC
>>>> -XX:+PrintGCApplicationConcurrentTime -XX:+PrintGCApplicationStoppedTime
>>>> -XX:+PrintGCDateStamps -XX:+PrintGCDetails -XX:+PrintGCTimeStamps
>>>> -XX:+PrintHeapAtGC -XX:+PrintTenuringDistribution -XX:SurvivorRatio=6
>>>> -XX:-UseAdaptiveSizePolicy -XX:+UseCompressedOops
>>>> -XX:+UseConcMarkSweepGC -XX:+UseGCLogFileRotation -XX:+UseParNewGC
>>>> 2015-11-09T01:29:55.640+0100: 38540,292: Total time for which
>>>> application threads were stopped: 55,8119519 seconds, Stopping threads
>>>> took: 0,0003857 seconds
>>>> 2015-11-09T01:29:55.648+0100: 38540,299: Application time: 0,0076173
>>>> seconds
>>>>
>>>> Note the 55.8 seconds pause directly after printing the flags and the
>>>> consistent timestamp jump from 01:28:59 to 01:29:55. There's no GC
>>>> output, although verbose GC is active and works. For some other reason
>>>> there is a very long safepoint. Note also, that the time is not due to
>>>> waiting until the safepoint is reached. At least the log claims that
>>>> reaching the safepoint only took 0.008 seconds. Also at that timeof day
>>>> the servers are not very busy.
>>>>
>>>> Is there any idea, what happens here? Anything that rings a bell between
>>>> 1.7.0_76 and 1.7.0_80? Why should there be a long safepoint directly
>>>> after GC rotation opened a new file?
>>>>
>>>> I searched the bug parade, but didn't find a good hit. There was also
>>>> nothing in the change for JDK-7164841 that seemed immediately
>>>> responsible for a long pause.
>>>>
>>>> Unfortunately this happens on a production system and the first thing
>>>> was to roll back to the old Java version.Not sure, how good this will be
>>>> reproducible on a test system (will check tomorrow).
>>>>
>>>> Thanks for any hint,
>>>>
>>>> Rainer

From rainer.jung at kippdata.de  Tue Nov 10 15:21:19 2015
From: rainer.jung at kippdata.de (Rainer Jung)
Date: Tue, 10 Nov 2015 16:21:19 +0100
Subject: Long safepoint pause directly after GC log file rotation in
	1.7.0_80
In-Reply-To: <564204D5.8080406@kippdata.de>
References: <564100BD.1070002@kippdata.de> <5641B53F.2000709@kippdata.de>
	<5641E766.9050306@kippdata.de>
	<CAHeneC9E2QLDUpskOsO8xtbH8GTAO=RoFiqBZYA=Mtxorqd=fw@mail.gmail.com>
	<564204D5.8080406@kippdata.de>
Message-ID: <56420B6F.5090908@kippdata.de>

Am 10.11.2015 um 15:53 schrieb Rainer Jung:
> Am 10.11.2015 um 14:31 schrieb Alex Bagehot:
>> http://bugs.java.com/bugdatabase/view_bug.do?bug_id=6478739 relates (&
>> states it is only for solaris)
>
> Thanks for the explanation. I now found, why the problem happens on 80
> but not on 76.
>
> Both versions contain the same very inefficient code. But in 76 there
> was a bug that closed the map file after reading the first entry, becuse
> the close() was done inside the read loop instead of after the read
> loop. That bug was fixed in 80, so that now actually the whole file is
> read and not only the first item. Only then does the performance bug
> exhibit itself.
>
> The responsible commit  is
>
> http://hg.openjdk.java.net/jdk7u/jdk7u/hotspot/diff/e50eb3195734/src/os/solaris/vm/os_solaris.cpp
>
>
> But the real bug is reading the map file item by item instead of using
> pread() to read it all at one like pmap does it.
>
> I expect the problem to be also in 1.8.0 and 1.9.0 (to be checked).

Yes, the same problem is in current 1.8.0 and 1.9.0. For 1.8.0 it was 
introduced in 1.8.0_20.

> What's the right list to tell about this hotspot issue?
>
> Regards,
>
> Rainer
>
>> On Tue, Nov 10, 2015 at 12:47 PM, Rainer Jung
>> <rainer.jung at kippdata.de> wrote:
>>> The pause is due to the call "(void) check_addr0(st)" in
>>> os::print_memory_info().
>>>
>>> The call reads "/proc/self/map". In our case it has for instance 1400
>>> entries, and each read takes about 40 ms.
>>>
>>> The same function check_addr0() is also used in
>>> os::run_periodic_checks().
>>> Not sure why it is also done directly after each GC log rotation.
>>>
>>> Regards,
>>>
>>> Rainer
>>>
>>>
>>> Am 10.11.2015 um 10:13 schrieb Rainer Jung:
>>>>
>>>> Addition: the longest pause that was experienced was more than 2400
>>>> seconds ...
>>>>
>>>> And: platform is Solaris Sparc (T4). But we don't know whether it is
>>>> platform dependent.
>>>>
>>>> It also happens on test systems, so I'll write a script that calls
>>>> pstack when detection is detected to find out, what the threads are
>>>> doing or where they are hanging.
>>>>
>>>> Regards,
>>>>
>>>> Rainer
>>>>
>>>> Am 09.11.2015 um 21:23 schrieb Rainer Jung:
>>>>>
>>>>> Hi,
>>>>>
>>>>> after upgrading from 1.7.0_76 to 1.7.0_80 we experience long pauses
>>>>> directly after a GC log rotation.
>>>>>
>>>>> The pause duration varies due to application and load but is in the
>>>>> range of 6 seconds to 60 seconds. There is no GC involved, i.e. no GC
>>>>> output is written related to these pauses.
>>>>>
>>>>> Example:
>>>>>
>>>>> Previous file ends with:
>>>>>
>>>>> 2015-11-09T01:28:36.832+0100: 38461,486: Application time: 5,2840810
>>>>> seconds
>>>>> {Heap before GC invocations=7366 (full 8):
>>>>> par new generation   total 458752K, used 442678K [0xfffffffe00400000,
>>>>> 0xfffffffe20400000, 0xfffffffe20400000)
>>>>>     eden space 393216K, 100% used [0xfffffffe00400000,
>>>>> 0xfffffffe18400000, 0xfffffffe18400000)
>>>>>     from space 65536K,  75% used [0xfffffffe18400000,
>>>>> 0xfffffffe1b44d998,
>>>>> 0xfffffffe1c400000)
>>>>>     to   space 65536K,   0% used [0xfffffffe1c400000,
>>>>> 0xfffffffe1c400000,
>>>>> 0xfffffffe20400000)
>>>>> concurrent mark-sweep generation total 3670016K, used 1887085K
>>>>> [0xfffffffe20400000, 0xffffffff00400000, 0xffffffff00400000)
>>>>> concurrent-mark-sweep perm gen total 524288K, used 453862K
>>>>> [0xffffffff00400000, 0xffffffff20400000, 0xffffffff20400000)
>>>>> 2015-11-09T01:28:36.839+0100: 38461,493:
>>>>> [GC2015-11-09T01:28:36.840+0100: 38461,493: [ParNew
>>>>> Desired survivor size 33554432 bytes, new threshold 16 (max 31)
>>>>> - age   1:    2964800 bytes,    2964800 total
>>>>> - age   2:    2628048 bytes,    5592848 total
>>>>> - age   3:    1415792 bytes,    7008640 total
>>>>> - age   4:    1354008 bytes,    8362648 total
>>>>> - age   5:    1132056 bytes,    9494704 total
>>>>> - age   6:    1334072 bytes,   10828776 total
>>>>> - age   7:    1407336 bytes,   12236112 total
>>>>> - age   8:    3321304 bytes,   15557416 total
>>>>> - age   9:    1531064 bytes,   17088480 total
>>>>> - age  10:    2453024 bytes,   19541504 total
>>>>> - age  11:    2797616 bytes,   22339120 total
>>>>> - age  12:    1698584 bytes,   24037704 total
>>>>> - age  13:    1870064 bytes,   25907768 total
>>>>> - age  14:    2211528 bytes,   28119296 total
>>>>> - age  15:    3626888 bytes,   31746184 total
>>>>> : 442678K->37742K(458752K), 0,0802687 secs]
>>>>> 2329763K->1924827K(4128768K), 0,0812120 secs] [Times: user=0,90
>>>>> sys=0,03, real=0,08 secs]
>>>>> Heap after GC invocations=7367 (full 8):
>>>>> par new generation   total 458752K, used 37742K [0xfffffffe00400000,
>>>>> 0xfffffffe20400000, 0xfffffffe20400000)
>>>>>     eden space 393216K,   0% used [0xfffffffe00400000,
>>>>> 0xfffffffe00400000, 0xfffffffe18400000)
>>>>>     from space 65536K,  57% used [0xfffffffe1c400000,
>>>>> 0xfffffffe1e8db9a0,
>>>>> 0xfffffffe20400000)
>>>>>     to   space 65536K,   0% used [0xfffffffe18400000,
>>>>> 0xfffffffe18400000,
>>>>> 0xfffffffe1c400000)
>>>>> concurrent mark-sweep generation total 3670016K, used 1887085K
>>>>> [0xfffffffe20400000, 0xffffffff00400000, 0xffffffff00400000)
>>>>> concurrent-mark-sweep perm gen total 524288K, used 453862K
>>>>> [0xffffffff00400000, 0xffffffff20400000, 0xffffffff20400000)
>>>>> }
>>>>> ....
>>>>> 2015-11-09T01:28:36.921+0100: 38461,575: Total time for which
>>>>> application threads were stopped: 0,0888232 seconds, Stopping threads
>>>>> took: 0,0005420 seconds
>>>>> 2015-11-09T01:28:59.821+0100: 38484,474: Application time: 0,0002954
>>>>> seconds
>>>>> 2015-11-09T01:28:59.823+0100: 38484,477: Total time for which
>>>>> application threads were stopped: 0,0026081 seconds, Stopping threads
>>>>> took: 0,0004146 seconds
>>>>> 2015-11-09T01:28:59.824+0100: 38484,477: Application time: 0,0003073
>>>>> seconds
>>>>> 2015-11-09T01:28:59.826+0100: 38484,480: Total time for which
>>>>> application threads were stopped: 0,0025411 seconds, Stopping threads
>>>>> took: 0,0004064 seconds
>>>>> 2015-11-09T01:28:59.827+0100: 38484,480: Application time: 0,0002885
>>>>> seconds
>>>>> 2015-11-09 01:28:59 GC log file has reached the maximum size. Saved as
>>>>> ./application/logs-a/mkb_gc.log.2
>>>>>
>>>>> This output looks normal. Last timestamp is 2015-11-09T01:28:59.827
>>>>>
>>>>> Now the next file begins:
>>>>>
>>>>> 2015-11-09 01:28:59 GC log file created
>>>>> ./application/logs-a/mkb_gc.log.3
>>>>> Java HotSpot(TM) 64-Bit Server VM (24.80-b11) for solaris-sparc JRE
>>>>> (1.7.0_80-b15), built on Apr 10 2015 18:47:18 by "" with Sun Studio
>>>>> 12u1
>>>>> Memory: 8k page, physical 133693440k(14956840k free)
>>>>> CommandLine flags: -XX:AllocateInstancePrefetchLines=2
>>>>> -XX:AllocatePrefetchInstr=1 -XX:AllocatePrefetchLines=6
>>>>> -XX:AllocatePrefetchStyle=3 -XX:+CMSClassUnloadingEnabled
>>>>> -XX:+ExplicitGCInvokesConcurrentAndUnloadsClasses
>>>>> -XX:GCLogFileSize=10485760 -XX:InitialHeapSize=4294967296
>>>>> -XX:MaxHeapSize=4294967296 -XX:MaxNewSize=536870912
>>>>> -XX:MaxPermSize=536870912 -XX:MaxTenuringThreshold=31
>>>>> -XX:NewSize=536870912 -XX:NumberOfGCLogFiles=10 -XX:OldPLABSize=16
>>>>> -XX:ParallelGCThreads=16 -XX:PermSize=536870912 -XX:+PrintGC
>>>>> -XX:+PrintGCApplicationConcurrentTime
>>>>> -XX:+PrintGCApplicationStoppedTime
>>>>> -XX:+PrintGCDateStamps -XX:+PrintGCDetails -XX:+PrintGCTimeStamps
>>>>> -XX:+PrintHeapAtGC -XX:+PrintTenuringDistribution -XX:SurvivorRatio=6
>>>>> -XX:-UseAdaptiveSizePolicy -XX:+UseCompressedOops
>>>>> -XX:+UseConcMarkSweepGC -XX:+UseGCLogFileRotation -XX:+UseParNewGC
>>>>> 2015-11-09T01:29:55.640+0100: 38540,292: Total time for which
>>>>> application threads were stopped: 55,8119519 seconds, Stopping threads
>>>>> took: 0,0003857 seconds
>>>>> 2015-11-09T01:29:55.648+0100: 38540,299: Application time: 0,0076173
>>>>> seconds
>>>>>
>>>>> Note the 55.8 seconds pause directly after printing the flags and the
>>>>> consistent timestamp jump from 01:28:59 to 01:29:55. There's no GC
>>>>> output, although verbose GC is active and works. For some other reason
>>>>> there is a very long safepoint. Note also, that the time is not due to
>>>>> waiting until the safepoint is reached. At least the log claims that
>>>>> reaching the safepoint only took 0.008 seconds. Also at that timeof
>>>>> day
>>>>> the servers are not very busy.
>>>>>
>>>>> Is there any idea, what happens here? Anything that rings a bell
>>>>> between
>>>>> 1.7.0_76 and 1.7.0_80? Why should there be a long safepoint directly
>>>>> after GC rotation opened a new file?
>>>>>
>>>>> I searched the bug parade, but didn't find a good hit. There was also
>>>>> nothing in the change for JDK-7164841 that seemed immediately
>>>>> responsible for a long pause.
>>>>>
>>>>> Unfortunately this happens on a production system and the first thing
>>>>> was to roll back to the old Java version.Not sure, how good this
>>>>> will be
>>>>> reproducible on a test system (will check tomorrow).
>>>>>
>>>>> Thanks for any hint,
>>>>>
>>>>> Rainer

From jun.zhuang at hobsons.com  Tue Nov 10 14:35:50 2015
From: jun.zhuang at hobsons.com (Jun Zhuang)
Date: Tue, 10 Nov 2015 14:35:50 +0000
Subject: Seeking answer to a GC pattern
In-Reply-To: <56411FC4.1030205@oracle.com>
References: <BY1PR02MB1130E9D5724E9FECD90ABF9281230@BY1PR02MB1130.namprd02.prod.outlook.com>
	<56411FC4.1030205@oracle.com>
Message-ID: <BY1PR02MB11300F2264942964B8B057C381140@BY1PR02MB1130.namprd02.prod.outlook.com>

Hi Yu,

Yes I have disabled the GC ergonomics after experimenting with many java startup parameter combinations. For all the GC algorithms I have tried, the GC behavior was more or less the same but some were definitely worse than others. Main problems I have are:

*         The saw-tooth pattern response time as measured by my load testing tool with or without -XX:+AlwaysTenure. Right before a full collection, the response time could reach a few seconds depending on the size of the young gen. The bigger the young gen the higher the highs of the response time.

*         Objects for this application seem to have a long life

*         Using the parameters I had in the previous email seems to give me the best performance so far.


A few examples of the gc log:

-XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:MaxTenuringThreshold=10 -XX:-UseAdaptiveSizePolicy -XX:+DisableExplicitGC

31265.959: [GC 31265.959: [ParNew
Desired survivor size 53673984 bytes, new threshold 10 (max 10)
- age   1:    1695168 bytes,    1695168 total
- age   2:     825984 bytes,    2521152 total
- age   3:     823424 bytes,    3344576 total
- age   4:     770776 bytes,    4115352 total
- age   5:     822064 bytes,    4937416 total
- age   6:     816984 bytes,    5754400 total
- age   7:     777064 bytes,    6531464 total
- age   8:     850344 bytes,    7381808 total
- age   9:     836016 bytes,    8217824 total
- age  10:     810968 bytes,    9028792 total
: 853704K->10926K(943744K), 1.4887200 secs] 1602653K->760749K(4089472K), 1.4889740 secs] [Times: user=5.46 sys=0.04, real=1.49 secs]
31365.452: [GC 31365.453: [ParNew
Desired survivor size 53673984 bytes, new threshold 10 (max 10)
- age   1:    1925120 bytes,    1925120 total
- age   2:     818208 bytes,    2743328 total
- age   3:     815128 bytes,    3558456 total
- age   4:     819992 bytes,    4378448 total
- age   5:     770648 bytes,    5149096 total
- age   6:     822064 bytes,    5971160 total
- age   7:     816984 bytes,    6788144 total
- age   8:     777032 bytes,    7565176 total
- age   9:     850344 bytes,    8415520 total
- age  10:     836016 bytes,    9251536 total
: 849838K->15550K(943744K), 1.4532880 secs] 1599661K->766188K(4089472K), 1.4535550 secs] [Times: user=5.38 sys=0.02, real=1.45 secs]

~~~~~~~~~~~~~~~~~
-server -XX:+UseCompressedOops -XX:+TieredCompilation -XX:ReservedCodeCacheSize=256m -XX:+UseCodeCacheFlushing -XX:+PrintTenuringDistribution -Xms2048m -Xmx4096m -XX:MaxPermSize=256m -XX:NewSize=128m -XX:MaxNewSize=128m -XX:SurvivorRatio=126 -XX:-UseAdaptiveSizePolicy -XX:+DisableExplicitGC -XX:+AlwaysTenure

2417.702: [GC [PSYoungGen: 129024K->0K(130048K)] 2078501K->1954674K(2096128K), 0.1765730 secs] [Times: user=0.64 sys=0.01, real=0.18 secs]
2421.355: [GC [PSYoungGen: 129007K->0K(130048K)] 2083681K->1957745K(2096128K), 0.1654380 secs] [Times: user=0.60 sys=0.01, real=0.17 secs]
2426.403: [GC [PSYoungGen: 129009K->0K(130048K)] 2086754K->1961487K(2096128K), 0.1711790 secs] [Times: user=0.62 sys=0.00, real=0.17 secs]
2426.575: [Full GC [PSYoungGen: 0K->0K(130048K)] [ParOldGen: 1961487K->353838K(1966080K)] 1961487K->353838K(2096128K) [PSPermGen: 103092K->100055K(203712K)], 1.1573820 secs] [Times: user=2.27 sys=0.14, real=1.16 secs]


Thanks,
Jun

From: Yu Zhang [mailto:yu.zhang at oracle.com]
Sent: Monday, November 09, 2015 5:36 PM
To: Jun Zhuang <jun.zhuang at hobsons.com>; hotspot-gc-use at openjdk.java.net
Subject: Re: Seeking answer to a GC pattern

Jun,

Sorry for the late response.
It seems you are disabling the gc ergonomic. Can you explain why? Do you need very low pause time? If you have a gc log, that would be helpful as well.


Thanks,

Jenny
On 10/26/2015 12:33 PM, Jun Zhuang wrote:
Hi,

When running performance testing for a java web service running on JBOSS, I observed a clear saw-tooth pattern in CPU utilization that closely follows the GC cycles. see below:

[cid:image001.jpg at 01D11B95.E81C2250]

[cid:image002.jpg at 01D11B95.E81C2250]


Java startup parameters used:

-XX:+TieredCompilation -XX:+PrintTenuringDistribution -Xms2048m -Xmx4096m -XX:MaxPermSize=256m -XX:NewSize=128m -XX:MaxNewSize=128m -XX:SurvivorRatio=126 -XX:-UseAdaptiveSizePolicy -XX:+DisableExplicitGC -XX:+AlwaysTenure

With this set of parameters, the young GC pause time ranged from 0.02 to 0.25 secs. When I used 256m for the young gen, the young GC pause time ranged from 0.02 to 0.5 secs. My understanding is that the young GC pause time normally stays fairly stable, I have spent quite some time researching but have yet to find an answer to this behavior. I wonder if people in this distribution list can help me out?

Other related info

* Server Specs: VM with 4 CPUs and 8 Gb mem
* Test setup:

*         # of Vusers: 100

*         Ramp up: 10 mins

*         Pacing: 5 - 7 secs
* I tried with all other available GC algorithms, tenuring thresholds, various sizes of the generations, but the AlwaysTenure parameter seems to work the best so far.

[cid:image003.jpg at 01D11B95.E81C2250]
[cid:image004.jpg at 01D11B95.E81C2250]

Any input will be highly appreciated.

Sincerely yours,

Jun Zhuang
Sr. Performance QA Engineer | Hobsons
513-746-2288 (work)
513-227-7643 (mobile)


_______________________________________________

hotspot-gc-use mailing list

hotspot-gc-use at openjdk.java.net<mailto:hotspot-gc-use at openjdk.java.net>

http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20151110/55710051/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image001.jpg
Type: image/jpeg
Size: 7083 bytes
Desc: image001.jpg
URL: <http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20151110/55710051/image001-0001.jpg>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image002.jpg
Type: image/jpeg
Size: 11773 bytes
Desc: image002.jpg
URL: <http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20151110/55710051/image002-0001.jpg>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image003.jpg
Type: image/jpeg
Size: 33498 bytes
Desc: image003.jpg
URL: <http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20151110/55710051/image003-0001.jpg>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image004.jpg
Type: image/jpeg
Size: 19704 bytes
Desc: image004.jpg
URL: <http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20151110/55710051/image004-0001.jpg>

From yu.zhang at oracle.com  Mon Nov  9 22:35:48 2015
From: yu.zhang at oracle.com (Yu Zhang)
Date: Mon, 9 Nov 2015 14:35:48 -0800
Subject: Seeking answer to a GC pattern
In-Reply-To: <BY1PR02MB1130E9D5724E9FECD90ABF9281230@BY1PR02MB1130.namprd02.prod.outlook.com>
References: <BY1PR02MB1130E9D5724E9FECD90ABF9281230@BY1PR02MB1130.namprd02.prod.outlook.com>
Message-ID: <56411FC4.1030205@oracle.com>

Jun,

Sorry for the late response.
It seems you are disabling the gc ergonomic. Can you explain why? Do you 
need very low pause time? If you have a gc log, that would be helpful as 
well.

Thanks,
Jenny

On 10/26/2015 12:33 PM, Jun Zhuang wrote:
>
> Hi,
>
> When running performance testing for a java web service running on 
> JBOSS, I observed a clear saw-tooth pattern in CPU utilization that 
> closely follows the GC cycles. see below:
>
> Java startup parameters used:
>
> -XX:+TieredCompilation -XX:+PrintTenuringDistribution -Xms2048m 
> -Xmx4096m -XX:MaxPermSize=256m -XX:NewSize=128m -XX:MaxNewSize=128m 
> -XX:SurvivorRatio=126 -XX:-UseAdaptiveSizePolicy 
> -XX:+DisableExplicitGC -XX:+AlwaysTenure
>
> With this set of parameters, the young GC pause time ranged from 0.02 
> to 0.25 secs. When I used 256m for the young gen, the young GC pause 
> time ranged from 0.02 to 0.5 secs. My understanding is that the young 
> GC pause time normally stays fairly stable, I have spent quite some 
> time researching but have yet to find an answer to this behavior. I 
> wonder if people in this distribution list can help me out?
>
> *_Other related info_*
>
> * Server Specs: VM with 4 CPUs and 8 Gb mem
>
> * Test setup:
>
> ?# of Vusers: 100
>
> ?Ramp up: 10 mins
>
> ?Pacing: 5 ? 7 secs
>
> * I tried with all other available GC algorithms, tenuring thresholds, 
> various sizes of the generations, but the AlwaysTenure parameter seems 
> to work the best so far.
>
> Any input will be highly appreciated.
>
> Sincerely yours,
>
> *//*
>
> */Jun Zhuang/**//*
>
> /Sr. Performance QA Engineer | Hobsons///
>
> /513-746-2288 (work)///
>
> /513-227-7643 (mobile)///
>
>
>
> _______________________________________________
> hotspot-gc-use mailing list
> hotspot-gc-use at openjdk.java.net
> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20151109/56ec1494/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: image/jpeg
Size: 7083 bytes
Desc: not available
URL: <http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20151109/56ec1494/attachment-0004.jpe>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: image/jpeg
Size: 11773 bytes
Desc: not available
URL: <http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20151109/56ec1494/attachment-0005.jpe>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: image/jpeg
Size: 33498 bytes
Desc: not available
URL: <http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20151109/56ec1494/attachment-0006.jpe>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: image/jpeg
Size: 19704 bytes
Desc: not available
URL: <http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20151109/56ec1494/attachment-0007.jpe>

From thomas.schatzl at oracle.com  Sat Nov 14 14:58:53 2015
From: thomas.schatzl at oracle.com (Thomas Schatzl)
Date: Sat, 14 Nov 2015 15:58:53 +0100
Subject: Seeking answer to a GC pattern
In-Reply-To: <BY1PR02MB11300F2264942964B8B057C381140@BY1PR02MB1130.namprd02.prod.outlook.com>
References: <BY1PR02MB1130E9D5724E9FECD90ABF9281230@BY1PR02MB1130.namprd02.prod.outlook.com>
	<56411FC4.1030205@oracle.com>
	<BY1PR02MB11300F2264942964B8B057C381140@BY1PR02MB1130.namprd02.prod.outlook.com>
Message-ID: <1447513133.2373.18.camel@oracle.com>

Hi,


> On 10/26/2015 12:33 PM, Jun Zhuang wrote:
>         Hi,
>          
>         When running performance testing for a java web service
>         running on JBOSS, I observed a clear saw-tooth pattern in CPU
>         utilization that closely follows the GC cycles. see below: 
>         
  one issue that could cause this is nepotism
(http://www.memorymanagement.org/glossary/n.html) of the objects that
continously get promoted to the old gen, that is cyclically cleaned up
by a kind of full gc.

I.e. over time more and more actually dead objects get promoted to the
old gen, but due to how generational gc works, they keep more and more
objects in young gen alive.

That would also explain why the problem is the same for any gc.

The only real solution that you can do is make the application code null
out references. Or make sure by proper young gen sizing that such
pointers are not created, i.e. no objects that might keep alive many
others in the young gen get promoted. That latter is a very brittle
"solution" though.

It may also just be the application though. One hint whether this is the
fault of nepotism is take heap snapshots (e.g. jmap -histo:live <pid>
should do fine), and look if the amount of live data stays roughly the
same when taken at different times of that increase in heap memory (or
at least does not increase similarly to the graphs).

If it is not nepotism, using heap dumps and to some degree the histogram
you may get information on what is keeping stuff alive.

Then you know whether it the problem are really dead objects keeping
stuff alive in young gen or not.

Thanks,
  Thomas


From jun.zhuang at hobsons.com  Fri Nov 20 19:46:32 2015
From: jun.zhuang at hobsons.com (Jun Zhuang)
Date: Fri, 20 Nov 2015 19:46:32 +0000
Subject: Seeking answer to a GC pattern
In-Reply-To: <CABzyjyn3Uy-tmNjr5xiybqw=BqO8e6uvrJYqHGaHdVAQLbY+Kw@mail.gmail.com>
References: <BY1PR02MB1130E9D5724E9FECD90ABF9281230@BY1PR02MB1130.namprd02.prod.outlook.com>
	<CABzyjyn3Uy-tmNjr5xiybqw=BqO8e6uvrJYqHGaHdVAQLbY+Kw@mail.gmail.com>
Message-ID: <SN1PR0201MB1854D703DFA93C283082373F811A0@SN1PR0201MB1854.namprd02.prod.outlook.com>

Hi Srinivas,

Thanks for your suggestion. I ran test with following parameters:

-server -XX:+UseCompressedOops -XX:+TieredCompilation -XX:ReservedCodeCacheSize=64m -XX:+UseCodeCacheFlushing -XX:+PrintTenuringDistribution -Xms2g -Xmx2g -XX:MaxPermSize=256m -XX:NewSize=128m -XX:MaxNewSize=128m -XX:SurvivorRatio=6 -XX:-UseAdaptiveSizePolicy -XX:+DisableExplicitGC -XX:MaxTenuringThreshold=2

But the -XX:MaxTenuringThreshold=2 setting does not seem to help anything. I am still seeing similar GC pattern as with the +AlwaysTenure, actually the young GC time is higher with MTT=2 (getting to 0.5 secs vs. 0.25 with AlwaysTenure).

Unless anyone else can provide another theory, I am convinced that nepotism is at play here. Changing the java startup parameters can only get me so far, dev will have to look at the code and see what can be done on the code level.

Thanks,
Jun

From: Srinivas Ramakrishna [mailto:ysr1729 at gmail.com]
Sent: Thursday, November 19, 2015 8:09 PM
To: Jun Zhuang <jun.zhuang at hobsons.com>
Cc: hotspot-gc-use at openjdk.java.net
Subject: Re: Seeking answer to a GC pattern

Use -XX:MaxTenuringThreshold=2 and you might see better behavior that +AlwaysTenure (which is almost always a very bad choice). That will at least reduce some of the nepotism issues from +AlwaysTenure that Thomas mentions. MTT > 2 is unlikely to help at your current frequency of minor collections since the mortality after age 1 is fairly low (from your tenuring distribution). Worth a quick test.

-- ramki

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20151120/974ec23d/attachment.html>

From csulyj at gmail.com  Sun Nov 22 00:00:53 2015
From: csulyj at gmail.com (yuanjun Li)
Date: Sun, 22 Nov 2015 08:00:53 +0800
Subject: frequent major gc but not free heap?
Message-ID: <CABsv-k8dHe+P=95BY8jdUhx0v+vi_OHs0kkPh1Rt_fZ9m2gXbg@mail.gmail.com>

After running several hours, My http server begin frequenly major gc, but
no heap was freed.

several times major gc later, promotion failed and concurrent mode
failure occured,
then heap was freed. My gc log is below :

{Heap before GC invocations=7172 (full 720):
 par new generation   total 737280K, used 667492K [0x000000076b800000,
0x000000079d800000, 0x000000079d800000)
  eden space 655360K, 100% used [0x000000076b800000,
0x0000000793800000, 0x0000000793800000)
  from space 81920K,  14% used [0x0000000793800000,
0x00000007943d91d0, 0x0000000798800000)
  to   space 81920K,   0% used [0x0000000798800000,
0x0000000798800000, 0x000000079d800000)
 concurrent mark-sweep generation total 1482752K, used 1479471K
[0x000000079d800000, 0x00000007f8000000, 0x00000007f8000000)
 concurrent-mark-sweep perm gen total 131072K, used 58091K
[0x00000007f8000000, 0x0000000800000000,
0x0000000800000000)2015-11-19T21:50:02.692+0800: 113963.532:
[GC2015-11-19T21:50:02.692+0800: 113963.532: [ParNew (promotion
failed)Desired survivor size 41943040 bytes, new threshold 15 (max
15)- age   1:    3826144 bytes,    3826144 total- age   2:     305696
bytes,    4131840 total- age   3:     181416 bytes,    4313256 total-
age   4:     940632 bytes,    5253888 total- age   5:      88368
bytes,    5342256 total- age   6:     159840 bytes,    5502096 total-
age   7:     733856 bytes,    6235952 total- age   8:      64712
bytes,    6300664 total- age   9:     314304 bytes,    6614968 total-
age  10:     587160 bytes,    7202128 total- age  11:      38728
bytes,    7240856 total- age  12:     221160 bytes,    7462016 total-
age  13:     648376 bytes,    8110392 total- age  14:      33296
bytes,    8143688 total- age  15:     380768 bytes,    8524456 total:
667492K->665908K(737280K), 0.7665810
secs]2015-11-19T21:50:03.459+0800: 113964.299:
[CMS2015-11-19T21:50:05.161+0800: 113966.001: [CMS-concurrent-mark:
3.579/4.747 secs] [Times: user=13.41 sys=0.35, rea
l=4.75 secs]
 (concurrent mode failure): 1479910K->44010K(1482752K), 4.7267420
secs] 2146964K->44010K(2220032K), [CMS Perm :
58091K->57795K(131072K)], 5.4939440 secs] [Times: user=9.07 sys=0.13,
real=5.49 secs] Heap after GC invocations=7173 (full 721):
 par new generation   total 737280K, used 0K [0x000000076b800000,
0x000000079d800000, 0x000000079d800000)
  eden space 655360K,   0% used [0x000000076b800000,
0x000000076b800000, 0x0000000793800000)
  from space 81920K,   0% used [0x0000000798800000,
0x0000000798800000, 0x000000079d800000)
  to   space 81920K,   0% used [0x0000000793800000,
0x0000000793800000, 0x0000000798800000)
 concurrent mark-sweep generation total 1482752K, used 44010K
[0x000000079d800000, 0x00000007f8000000, 0x00000007f8000000)
 concurrent-mark-sweep perm gen total 131072K, used 57795K
[0x00000007f8000000, 0x0000000800000000, 0x0000000800000000)}

It seems the CMS GC doesn't make any sense. Could you please explain to me ?

This is my gc config:

-server -Xms2248m -Xmx2248m -Xmn800m -XX:PermSize=128m
-XX:MaxPermSize=128m -XX:MaxTenuringThreshold=15
-XX:+UseCMSCompactAtFullCollection -XX:CMSFullGCsBeforeCompaction=0
-XX:+UseConcMarkSweepGC -XX:+PrintGCDetails -XX:+PrintGCTimeStamps
-XX:+PrintGCDateStamps -Xloggc:gc.log -XX:+PrintHeapAtGC
-XX:+PrintTenuringDistribution -XX:+UseFastAccessorMethods
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20151122/10ec323e/attachment-0001.html>

From ecki at zusammenkunft.net  Mon Nov 23 22:41:25 2015
From: ecki at zusammenkunft.net (Bernd)
Date: Mon, 23 Nov 2015 22:41:25 +0000
Subject: frequent major gc but not free heap?
In-Reply-To: <CABsv-k8dHe+P=95BY8jdUhx0v+vi_OHs0kkPh1Rt_fZ9m2gXbg@mail.gmail.com>
References: <CABsv-k8dHe+P=95BY8jdUhx0v+vi_OHs0kkPh1Rt_fZ9m2gXbg@mail.gmail.com>
Message-ID: <CABOR3+yRf1+kT4X6_hjnnDCjNt0DXZS27_8n0oxgNWdif5nYeA@mail.gmail.com>

Hello,

Can you provide a link to a full log file?

>From your description  it sounds like the memory was used, so a full GC
cant free anything. This can either be, that it just take some more time
until it was unused (requested/task ended or transaction timed out) or the
additional tries triggered some soft references or classloaders to be
freed. Hard to say without knowing your applications and having details.

I would suggest you generate a heap dump next time the heap becomes full
(if this is a slow event) or you try to correlate the memory problems with
actual jobs or reqest types (if it happens quickly based on some workload)

Bernd

yuanjun Li <csulyj at gmail.com> schrieb am Mo., 23. Nov. 2015 23:33:

> After running several hours, My http server begin frequenly major gc, but
> no heap was freed.
>
> several times major gc later, promotion failed and concurrent mode failure occured,
> then heap was freed. My gc log is below :
>
> {Heap before GC invocations=7172 (full 720):
>  par new generation   total 737280K, used 667492K [0x000000076b800000, 0x000000079d800000, 0x000000079d800000)
>   eden space 655360K, 100% used [0x000000076b800000, 0x0000000793800000, 0x0000000793800000)
>   from space 81920K,  14% used [0x0000000793800000, 0x00000007943d91d0, 0x0000000798800000)
>   to   space 81920K,   0% used [0x0000000798800000, 0x0000000798800000, 0x000000079d800000)
>  concurrent mark-sweep generation total 1482752K, used 1479471K [0x000000079d800000, 0x00000007f8000000, 0x00000007f8000000)
>  concurrent-mark-sweep perm gen total 131072K, used 58091K [0x00000007f8000000, 0x0000000800000000, 0x0000000800000000)2015-11-19T21:50:02.692+0800: 113963.532: [GC2015-11-19T21:50:02.692+0800: 113963.532: [ParNew (promotion failed)Desired survivor size 41943040 bytes, new threshold 15 (max 15)- age   1:    3826144 bytes,    3826144 total- age   2:     305696 bytes,    4131840 total- age   3:     181416 bytes,    4313256 total- age   4:     940632 bytes,    5253888 total- age   5:      88368 bytes,    5342256 total- age   6:     159840 bytes,    5502096 total- age   7:     733856 bytes,    6235952 total- age   8:      64712 bytes,    6300664 total- age   9:     314304 bytes,    6614968 total- age  10:     587160 bytes,    7202128 total- age  11:      38728 bytes,    7240856 total- age  12:     221160 bytes,    7462016 total- age  13:     648376 bytes,    8110392 total- age  14:      33296 bytes,    8143688 total- age  15:     380768 bytes,    8524456 total: 667492K->665908K(737280K), 0.7665810 secs]2015-11-19T21:50:03.459+0800: 113964.299: [CMS2015-11-19T21:50:05.161+0800: 113966.001: [CMS-concurrent-mark: 3.579/4.747 secs] [Times: user=13.41 sys=0.35, rea
> l=4.75 secs]
>  (concurrent mode failure): 1479910K->44010K(1482752K), 4.7267420 secs] 2146964K->44010K(2220032K), [CMS Perm : 58091K->57795K(131072K)], 5.4939440 secs] [Times: user=9.07 sys=0.13, real=5.49 secs] Heap after GC invocations=7173 (full 721):
>  par new generation   total 737280K, used 0K [0x000000076b800000, 0x000000079d800000, 0x000000079d800000)
>   eden space 655360K,   0% used [0x000000076b800000, 0x000000076b800000, 0x0000000793800000)
>   from space 81920K,   0% used [0x0000000798800000, 0x0000000798800000, 0x000000079d800000)
>   to   space 81920K,   0% used [0x0000000793800000, 0x0000000793800000, 0x0000000798800000)
>  concurrent mark-sweep generation total 1482752K, used 44010K [0x000000079d800000, 0x00000007f8000000, 0x00000007f8000000)
>  concurrent-mark-sweep perm gen total 131072K, used 57795K [0x00000007f8000000, 0x0000000800000000, 0x0000000800000000)}
>
> It seems the CMS GC doesn't make any sense. Could you please explain to
> me ?
>
> This is my gc config:
>
> -server -Xms2248m -Xmx2248m -Xmn800m -XX:PermSize=128m -XX:MaxPermSize=128m -XX:MaxTenuringThreshold=15 -XX:+UseCMSCompactAtFullCollection -XX:CMSFullGCsBeforeCompaction=0 -XX:+UseConcMarkSweepGC -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -Xloggc:gc.log -XX:+PrintHeapAtGC -XX:+PrintTenuringDistribution -XX:+UseFastAccessorMethods
>
>
>
> _______________________________________________
> hotspot-gc-use mailing list
> hotspot-gc-use at openjdk.java.net
> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20151123/ce71c41d/attachment-0001.html>

From Denny.Kettwig at werum.com  Tue Nov 24 11:13:12 2015
From: Denny.Kettwig at werum.com (Denny Kettwig)
Date: Tue, 24 Nov 2015 11:13:12 +0000
Subject: GCInterval in Java 8
Message-ID: <6175F8C4FE407D4F830EDA25C27A431701433E5194@Werum1790.werum.net>

Hello,

I have quick question regarding these two parameters:

-Dsun.rmi.dgc.client.gcInterval=3600000
-Dsun.rmi.dgc.server.gcInterval=3600000

We have set these parameters in the past to force a Full GC once every hour, however since we switched to Java 8 the parameter no longer has any effect. Has something changed in past? I can't find any source in the net mentioning a change in this area.

Regards,

Denny Kettwig
Software Engineer


[cid:image001.jpg at 01D126B1.7CCEF5A0]

Werum IT Solutions GmbH
Wulf-Werum-Str. 3, 21337 L?neburg, Germany

T +49 4131 8900-983
F +49 4131 8900-20
denny.kettwig at werum.com<mailto:denny.kettwig at werum.com>
www.werum.com<http://www.werum.com/>

Gesch?ftsf?hrer / Managing Directors: R?diger Schlierenk?mper, Richard Nagorny, Hans-Peter Subel
RG L?neburg / Court of Jurisdiction: L?neburg, Germany
Handelsregisternummer / Commercial Register: HRB 204984
USt.-IdNr. / VAT No.: DE 116 083 850

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20151124/8b17b98c/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image001.jpg
Type: image/jpeg
Size: 9408 bytes
Desc: image001.jpg
URL: <http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20151124/8b17b98c/image001.jpg>

From molendag at gmail.com  Tue Nov 24 17:17:44 2015
From: molendag at gmail.com (Grzegorz Molenda)
Date: Tue, 24 Nov 2015 18:17:44 +0100
Subject: Seeking answer to a GC pattern
In-Reply-To: <SN1PR0201MB1854D703DFA93C283082373F811A0@SN1PR0201MB1854.namprd02.prod.outlook.com>
References: <BY1PR02MB1130E9D5724E9FECD90ABF9281230@BY1PR02MB1130.namprd02.prod.outlook.com>
	<CABzyjyn3Uy-tmNjr5xiybqw=BqO8e6uvrJYqHGaHdVAQLbY+Kw@mail.gmail.com>
	<SN1PR0201MB1854D703DFA93C283082373F811A0@SN1PR0201MB1854.namprd02.prod.outlook.com>
Message-ID: <CAFoLFGHXEGcehn24K6x=YbSRrAGLV=mNBKkX84VNAK9dbYNLOA@mail.gmail.com>

Just a few tips:

Check OS stats for paging / swapping activity at both VM'and hypervisor
levels.

Make sure the OS doesn't use transparent huge pages.

If the above two don't help, try enabling -XX:+PrintGCTaskTimeStamps to
diagnose, which part of GC collecion takes the most time. Note values
aren't reported in time units, but in ticks. Subtract one from the other
reported per task . Compare between tasks per signle collection and check
stats from a few collections in row, to get the idea, where it does
degradate.

Thanks,

Grzegorz


2015-11-20 20:46 GMT+01:00 Jun Zhuang <jun.zhuang at hobsons.com>:

> Hi Srinivas,
>
>
>
> Thanks for your suggestion. I ran test with following parameters:
>
>
>
> -server -XX:+UseCompressedOops -XX:+TieredCompilation
> -XX:ReservedCodeCacheSize=64m -XX:+UseCodeCacheFlushing
> -XX:+PrintTenuringDistribution -Xms2g -Xmx2g -XX:MaxPermSize=256m
> -XX:NewSize=128m -XX:MaxNewSize=128m -XX:SurvivorRatio=6
> -XX:-UseAdaptiveSizePolicy -XX:+DisableExplicitGC -XX:MaxTenuringThreshold=2
>
>
>
> But the -XX:MaxTenuringThreshold=2 setting does not seem to help anything.
> I am still seeing similar GC pattern as with the +AlwaysTenure, actually
> the young GC time is higher with MTT=2 (getting to 0.5 secs vs. 0.25 with
> AlwaysTenure).
>
>
>
> Unless anyone else can provide another theory, I am convinced that
> nepotism is at play here. Changing the java startup parameters can only get
> me so far, dev will have to look at the code and see what can be done on
> the code level.
>
>
>
> Thanks,
>
> Jun
>
>
>
> *From:* Srinivas Ramakrishna [mailto:ysr1729 at gmail.com]
> *Sent:* Thursday, November 19, 2015 8:09 PM
> *To:* Jun Zhuang <jun.zhuang at hobsons.com>
> *Cc:* hotspot-gc-use at openjdk.java.net
> *Subject:* Re: Seeking answer to a GC pattern
>
>
>
> Use -XX:MaxTenuringThreshold=2 and you might see better behavior that
> +AlwaysTenure (which is almost always a very bad choice). That will at
> least reduce some of the nepotism issues from +AlwaysTenure that Thomas
> mentions. MTT > 2 is unlikely to help at your current frequency of minor
> collections since the mortality after age 1 is fairly low (from your
> tenuring distribution). Worth a quick test.
>
>
>
> -- ramki
>
>
>
> _______________________________________________
> hotspot-gc-use mailing list
> hotspot-gc-use at openjdk.java.net
> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20151124/40efcdb1/attachment.html>

From rainer.jung at kippdata.de  Tue Nov 24 19:07:43 2015
From: rainer.jung at kippdata.de (Rainer Jung)
Date: Tue, 24 Nov 2015 20:07:43 +0100
Subject: GCInterval in Java 8
In-Reply-To: <6175F8C4FE407D4F830EDA25C27A431701433E5194@Werum1790.werum.net>
References: <6175F8C4FE407D4F830EDA25C27A431701433E5194@Werum1790.werum.net>
Message-ID: <5654B57F.4030209@kippdata.de>

Am 24.11.2015 um 12:13 schrieb Denny Kettwig:
> Hello,
>
> I have quick question regarding these two parameters:
>
> -Dsun.rmi.dgc.client.gcInterval=3600000
>
> -Dsun.rmi.dgc.server.gcInterval=3600000
>
> We have set these parameters in the past to force a Full GC once every
> hour, however since we switched to Java 8 the parameter no longer has
> any effect. Has something changed in past? I can?t find any source in
> the net mentioning a change in this area.

Originally the params resulted in a distributed GC exactly every 3600 
seconds apart from each other. I vaguely remember that at some point in 
time it changed to running it if no other GC of Tenured had run since at 
least an hour and only then DGC kicked in. So as long as there are other 
reasons for normal GC of Tenured/OldGen at least once an hour you would 
no longer observe a DGC.

Other sources might confirm this claim and know, in which version that 
was introduced. Could it be the reason for your observation, or do you 
have not other GCs of Tenured/OldGen as well?

How do you check for GC events?

Regards,

Rainer

From ysr1729 at gmail.com  Tue Nov 24 20:13:42 2015
From: ysr1729 at gmail.com (Srinivas Ramakrishna)
Date: Tue, 24 Nov 2015 12:13:42 -0800
Subject: Seeking answer to a GC pattern
In-Reply-To: <CAFoLFGHXEGcehn24K6x=YbSRrAGLV=mNBKkX84VNAK9dbYNLOA@mail.gmail.com>
References: <BY1PR02MB1130E9D5724E9FECD90ABF9281230@BY1PR02MB1130.namprd02.prod.outlook.com>
	<CABzyjyn3Uy-tmNjr5xiybqw=BqO8e6uvrJYqHGaHdVAQLbY+Kw@mail.gmail.com>
	<SN1PR0201MB1854D703DFA93C283082373F811A0@SN1PR0201MB1854.namprd02.prod.outlook.com>
	<CAFoLFGHXEGcehn24K6x=YbSRrAGLV=mNBKkX84VNAK9dbYNLOA@mail.gmail.com>
Message-ID: <CABzyjykPCzi=g_oeRGwGadwZCuDajdf9tQt6+D4kB+hzN1Frrg@mail.gmail.com>

What Grzegorz & Thomas said.

Also you might take a heap dump before and after a full gc
(-XX:+PrintClassHistogramBefore/AfterFullGC) to see the types that are
reclaimed in the old gen. Might give you an idea as to the types of objects
that got promoted and then later died, and hence whether avoidable nepotism
is or is not a factor (and thence what you might null out to reduce such
nepotism).

Also, I guess what I meant was MTT=1. However, given that going from MTT=10
to MTT=2 didn't make any appreciable difference, MTT=2 to MTT=1 will not
either.

You might also consider increasing yr young gen size, but that will likely
also increase your pause times since objects tend to either die quickly or
survive for a long time, and increasing the young gen size will still not
age objects sufficiently to cause an increase in mortality.

How many CPU's (and GC threads) do you have? Does the ratio of "real" to
"usr+sys" increase as "real" ramps up? Does the amount that is promoted
stay constant? That might imply that something is interfering with
parallelization of copying. Typically that means that there is a long
skinny data structure, such as a singly linked list that is being copied,
although why that list would become longer (in terms of longer times) is
not clear. Does the sawtooth of minor gc times happen even with MTT=1 or
AlwaysTenure? (Hint: How many young collections do you see between the
major collections when you see the sawtooth in young collection times? How
does it compare with the highest age of object that is kept in the survivor
spaces?)

-- ramki

On Tue, Nov 24, 2015 at 9:17 AM, Grzegorz Molenda <molendag at gmail.com>
wrote:

> Just a few tips:
>
> Check OS stats for paging / swapping activity at both VM'and hypervisor
> levels.
>
> Make sure the OS doesn't use transparent huge pages.
>
> If the above two don't help, try enabling -XX:+PrintGCTaskTimeStamps to
> diagnose, which part of GC collecion takes the most time. Note values
> aren't reported in time units, but in ticks. Subtract one from the other
> reported per task . Compare between tasks per signle collection and check
> stats from a few collections in row, to get the idea, where it does
> degradate.
>
> Thanks,
>
> Grzegorz
>
>
>
> 2015-11-20 20:46 GMT+01:00 Jun Zhuang <jun.zhuang at hobsons.com>:
>
>> Hi Srinivas,
>>
>>
>>
>> Thanks for your suggestion. I ran test with following parameters:
>>
>>
>>
>> -server -XX:+UseCompressedOops -XX:+TieredCompilation
>> -XX:ReservedCodeCacheSize=64m -XX:+UseCodeCacheFlushing
>> -XX:+PrintTenuringDistribution -Xms2g -Xmx2g -XX:MaxPermSize=256m
>> -XX:NewSize=128m -XX:MaxNewSize=128m -XX:SurvivorRatio=6
>> -XX:-UseAdaptiveSizePolicy -XX:+DisableExplicitGC -XX:MaxTenuringThreshold=2
>>
>>
>>
>> But the -XX:MaxTenuringThreshold=2 setting does not seem to help
>> anything. I am still seeing similar GC pattern as with the +AlwaysTenure,
>> actually the young GC time is higher with MTT=2 (getting to 0.5 secs vs.
>> 0.25 with AlwaysTenure).
>>
>>
>>
>> Unless anyone else can provide another theory, I am convinced that
>> nepotism is at play here. Changing the java startup parameters can only get
>> me so far, dev will have to look at the code and see what can be done on
>> the code level.
>>
>>
>>
>> Thanks,
>>
>> Jun
>>
>>
>>
>> *From:* Srinivas Ramakrishna [mailto:ysr1729 at gmail.com]
>> *Sent:* Thursday, November 19, 2015 8:09 PM
>> *To:* Jun Zhuang <jun.zhuang at hobsons.com>
>> *Cc:* hotspot-gc-use at openjdk.java.net
>> *Subject:* Re: Seeking answer to a GC pattern
>>
>>
>>
>> Use -XX:MaxTenuringThreshold=2 and you might see better behavior that
>> +AlwaysTenure (which is almost always a very bad choice). That will at
>> least reduce some of the nepotism issues from +AlwaysTenure that Thomas
>> mentions. MTT > 2 is unlikely to help at your current frequency of minor
>> collections since the mortality after age 1 is fairly low (from your
>> tenuring distribution). Worth a quick test.
>>
>>
>>
>> -- ramki
>>
>>
>>
>> _______________________________________________
>> hotspot-gc-use mailing list
>> hotspot-gc-use at openjdk.java.net
>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20151124/9efe3d6f/attachment-0001.html>

From jun.zhuang at hobsons.com  Tue Nov 24 21:21:45 2015
From: jun.zhuang at hobsons.com (Jun Zhuang)
Date: Tue, 24 Nov 2015 21:21:45 +0000
Subject: Seeking answer to a GC pattern
In-Reply-To: <CABzyjykPCzi=g_oeRGwGadwZCuDajdf9tQt6+D4kB+hzN1Frrg@mail.gmail.com>
References: <BY1PR02MB1130E9D5724E9FECD90ABF9281230@BY1PR02MB1130.namprd02.prod.outlook.com>
	<CABzyjyn3Uy-tmNjr5xiybqw=BqO8e6uvrJYqHGaHdVAQLbY+Kw@mail.gmail.com>
	<SN1PR0201MB1854D703DFA93C283082373F811A0@SN1PR0201MB1854.namprd02.prod.outlook.com>
	<CAFoLFGHXEGcehn24K6x=YbSRrAGLV=mNBKkX84VNAK9dbYNLOA@mail.gmail.com>
	<CABzyjykPCzi=g_oeRGwGadwZCuDajdf9tQt6+D4kB+hzN1Frrg@mail.gmail.com>
Message-ID: <SN1PR0201MB1854ACFD3D9270611F20BC3E81060@SN1PR0201MB1854.namprd02.prod.outlook.com>

Hi Srinivas,

Appreciate your input. Following are answers to your questions. I?ll try your other advices.


-          How many CPU's (and GC threads) do you have? Does the ratio of "real" to "usr+sys" increase as "real" ramps up?
4 CPUs.
Here is the time for one of the young GCs with +AlwaysTenure: Times: user=1.45 sys=0.00, real=0.40 secs. The sys time is always close to 0, user time is more than 3x real time and increases with real time accordingly.


-          Does the amount that is promoted stay constant?

With +AlwaysTenure, looks like the promoted amount was fairly constant @ about 125K. Following shows the first 3 young GCs right after a full collection and last 3 right before the next one.


328706.505: [GC [PSYoungGen: 129018K->0K(130048K)] 286309K->160838K(2096896K), 0.0152110 secs] [Times: user=0.04 sys=0.01, real=0.01 secs]

328711.687: [GC [PSYoungGen: 129024K->0K(130048K)] 289862K->165092K(2096896K), 0.0199390 secs] [Times: user=0.06 sys=0.00, real=0.02 secs]

328716.875: [GC [PSYoungGen: 129024K->0K(130048K)] 294116K->168626K(2096896K), 0.0247520 secs] [Times: user=0.07 sys=0.00, real=0.02 secs]


?
331103.140: [GC [PSYoungGen: 129024K->0K(130048K)] 2082788K->1957116K(2096896K), 0.2220360 secs] [Times: user=0.78 sys=0.00, real=0.23 secs]

331108.118: [GC [PSYoungGen: 129024K->0K(130048K)] 2086140K->1960268K(2096896K), 0.2170640 secs] [Times: user=0.79 sys=0.01, real=0.22 secs]

331113.074: [GC [PSYoungGen: 129024K->0K(130048K)] 2089292K->1963948K(2096896K), 0.2132430 secs] [Times: user=0.79 sys=0.00, real=0.21 secs]


-          Does the sawtooth of minor gc times happen even with MTT=1 or AlwaysTenure?

Yes. It always happens with or without AlwaysTenure.


-          (Hint: How many young collections do you see between the major collections when you see the sawtooth in young collection times? How does it compare with the highest age of object that is kept in the survivor spaces?)

For one of my tests with 128m young gen and +AlwaysTenure, the average # of young GCs before a full collection was a little over 500.

Sincerely,
Jun

From: Srinivas Ramakrishna [mailto:ysr1729 at gmail.com]
Sent: Tuesday, November 24, 2015 3:14 PM
To: Grzegorz Molenda <molendag at gmail.com>
Cc: Jun Zhuang <jun.zhuang at hobsons.com>; hotspot-gc-use at openjdk.java.net
Subject: Re: Seeking answer to a GC pattern

What Grzegorz & Thomas said.

Also you might take a heap dump before and after a full gc (-XX:+PrintClassHistogramBefore/AfterFullGC) to see the types that are reclaimed in the old gen. Might give you an idea as to the types of objects that got promoted and then later died, and hence whether avoidable nepotism is or is not a factor (and thence what you might null out to reduce such nepotism).

Also, I guess what I meant was MTT=1. However, given that going from MTT=10 to MTT=2 didn't make any appreciable difference, MTT=2 to MTT=1 will not either.

You might also consider increasing yr young gen size, but that will likely also increase your pause times since objects tend to either die quickly or survive for a long time, and increasing the young gen size will still not age objects sufficiently to cause an increase in mortality.

How many CPU's (and GC threads) do you have? Does the ratio of "real" to "usr+sys" increase as "real" ramps up? Does the amount that is promoted stay constant? That might imply that something is interfering with parallelization of copying. Typically that means that there is a long skinny data structure, such as a singly linked list that is being copied, although why that list would become longer (in terms of longer times) is not clear. Does the sawtooth of minor gc times happen even with MTT=1 or AlwaysTenure? (Hint: How many young collections do you see between the major collections when you see the sawtooth in young collection times? How does it compare with the highest age of object that is kept in the survivor spaces?)

-- ramki

On Tue, Nov 24, 2015 at 9:17 AM, Grzegorz Molenda <molendag at gmail.com<mailto:molendag at gmail.com>> wrote:
Just a few tips:

Check OS stats for paging / swapping activity at both VM'and hypervisor levels.

Make sure the OS doesn't use transparent huge pages.

If the above two don't help, try enabling -XX:+PrintGCTaskTimeStamps to diagnose, which part of GC collecion takes the most time. Note values aren't reported in time units, but in ticks. Subtract one from the other reported per task . Compare between tasks per signle collection and check stats from a few collections in row, to get the idea, where it does degradate.

Thanks,

Grzegorz


2015-11-20 20:46 GMT+01:00 Jun Zhuang <jun.zhuang at hobsons.com<mailto:jun.zhuang at hobsons.com>>:
Hi Srinivas,

Thanks for your suggestion. I ran test with following parameters:

-server -XX:+UseCompressedOops -XX:+TieredCompilation -XX:ReservedCodeCacheSize=64m -XX:+UseCodeCacheFlushing -XX:+PrintTenuringDistribution -Xms2g -Xmx2g -XX:MaxPermSize=256m -XX:NewSize=128m -XX:MaxNewSize=128m -XX:SurvivorRatio=6 -XX:-UseAdaptiveSizePolicy -XX:+DisableExplicitGC -XX:MaxTenuringThreshold=2

But the -XX:MaxTenuringThreshold=2 setting does not seem to help anything. I am still seeing similar GC pattern as with the +AlwaysTenure, actually the young GC time is higher with MTT=2 (getting to 0.5 secs vs. 0.25 with AlwaysTenure).

Unless anyone else can provide another theory, I am convinced that nepotism is at play here. Changing the java startup parameters can only get me so far, dev will have to look at the code and see what can be done on the code level.

Thanks,
Jun

From: Srinivas Ramakrishna [mailto:ysr1729 at gmail.com<mailto:ysr1729 at gmail.com>]
Sent: Thursday, November 19, 2015 8:09 PM
To: Jun Zhuang <jun.zhuang at hobsons.com<mailto:jun.zhuang at hobsons.com>>
Cc: hotspot-gc-use at openjdk.java.net<mailto:hotspot-gc-use at openjdk.java.net>
Subject: Re: Seeking answer to a GC pattern

Use -XX:MaxTenuringThreshold=2 and you might see better behavior that +AlwaysTenure (which is almost always a very bad choice). That will at least reduce some of the nepotism issues from +AlwaysTenure that Thomas mentions. MTT > 2 is unlikely to help at your current frequency of minor collections since the mortality after age 1 is fairly low (from your tenuring distribution). Worth a quick test.

-- ramki


_______________________________________________
hotspot-gc-use mailing list
hotspot-gc-use at openjdk.java.net<mailto:hotspot-gc-use at openjdk.java.net>
http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20151124/7d03089c/attachment.html>

From Peter.B.Kessler at Oracle.COM  Wed Nov 25 18:58:07 2015
From: Peter.B.Kessler at Oracle.COM (Peter B. Kessler)
Date: Wed, 25 Nov 2015 10:58:07 -0800
Subject: Seeking answer to a GC pattern
In-Reply-To: <SN1PR0201MB1854ACFD3D9270611F20BC3E81060@SN1PR0201MB1854.namprd02.prod.outlook.com>
References: <BY1PR02MB1130E9D5724E9FECD90ABF9281230@BY1PR02MB1130.namprd02.prod.outlook.com>	<CABzyjyn3Uy-tmNjr5xiybqw=BqO8e6uvrJYqHGaHdVAQLbY+Kw@mail.gmail.com>	<SN1PR0201MB1854D703DFA93C283082373F811A0@SN1PR0201MB1854.namprd02.prod.outlook.com>	<CAFoLFGHXEGcehn24K6x=YbSRrAGLV=mNBKkX84VNAK9dbYNLOA@mail.gmail.com>	<CABzyjykPCzi=g_oeRGwGadwZCuDajdf9tQt6+D4kB+hzN1Frrg@mail.gmail.com>
	<SN1PR0201MB1854ACFD3D9270611F20BC3E81060@SN1PR0201MB1854.namprd02.prod.outlook.com>
Message-ID: <565604BF.9020303@Oracle.COM>

On 11/24/15 01:21 PM, Jun Zhuang wrote:
> Hi Srinivas,
>
> Appreciate your input. Following are answers to your questions. I?ll try your other advices.
>
> -How many CPU's (and GC threads) do you have? Does the ratio of "real" to "usr+sys" increase as "real" ramps up?
>
> 4 CPUs.
>
> Here is the time for one of the young GCs with +AlwaysTenure: Times: user=1.45 sys=0.00, real=0.40 secs. The sys time is always close to 0, user time is more than 3x real time and increases with real time accordingly.
>
> -Does the amount that is promoted stay constant?
>
> With +AlwaysTenure, looks like the promoted amount was fairly constant @ about 125K. Following shows the first 3 young GCs right after a full collection and last 3 right before the next one.
>
> 328706.505: [GC [PSYoungGen: 129018K->0K(130048K)] 286309K->160838K(2096896K), 0.0152110 secs] [Times: user=0.04 sys=0.01, real=0.01 secs]
>
> 328711.687: [GC [PSYoungGen: 129024K->0K(130048K)] 289862K->165092K(2096896K), 0.0199390 secs] [Times: user=0.06 sys=0.00, real=0.02 secs]
>
> 328716.875: [GC [PSYoungGen: 129024K->0K(130048K)] 294116K->168626K(2096896K), 0.0247520 secs] [Times: user=0.07 sys=0.00, real=0.02 secs]
>
> ?
>
> 331103.140: [GC [PSYoungGen: 129024K->0K(130048K)] 2082788K->1957116K(2096896K), 0.2220360 secs] [Times: user=0.78 sys=0.00, real=0.23 secs]
>
> 331108.118: [GC [PSYoungGen: 129024K->0K(130048K)] 2086140K->1960268K(2096896K), 0.2170640 secs] [Times: user=0.79 sys=0.01, real=0.22 secs]
>
> 331113.074: [GC [PSYoungGen: 129024K->0K(130048K)] 2089292K->1963948K(2096896K), 0.2132430 secs] [Times: user=0.79 sys=0.00, real=0.21 secs]
>
> -Does the sawtooth of minor gc times happen even with MTT=1 or AlwaysTenure?
>
> Yes. It always happens with or without AlwaysTenure.
>
> -(Hint: How many young collections do you see between the major collections when you see the sawtooth in young collection times? How does it compare with the highest age of object that is kept in the survivor spaces?)
>
> For one of my tests with 128m young gen and +AlwaysTenure, the average # of young GCs before a full collection was a little over 500.
>
> Sincerely,
>
> Jun

Looking at the increase in your heap size after each young generation collection seems to show that you are promoting 3MB~4MB at each young generation collection.  E.g., 165092K - 160838K = 4254K.  With your 1920MB old generation that would let you have 500 young generation collections between full collections, as you say.  If you were promoting only 125KB at each young generation collection your 1920MB old generation could absorb promotions from 15000 young generation collections.

What is confusing is that times for the young generation collections increases proportionally with the size of the old generation.  Your sawtooth pattern.  Usually the time for a young generation collection is proportional to the amount of space that is promoted, which seems to be constant in your case.  That implies some cost proportional to the size of the old generation: but what?  It does not seem to take you longer to allocate through the space in your young generation when the old generation is empty than when it is full (~5 seconds) so it does not seem like you are doing more work when the old generation is full: e.g., dirtying cards for data that has piled up in the old generation, which would cause more work for the collector.

With a 10x increase in time like that, one would think it would be easy to identify with a profiler, or detailed timers for phases, if there are any in the code.

To Ramki: The parallelization seems to hold at somewhat over 3.

			... peter

> *From:*Srinivas Ramakrishna [mailto:ysr1729 at gmail.com]
> *Sent:* Tuesday, November 24, 2015 3:14 PM
> *To:* Grzegorz Molenda <molendag at gmail.com>
> *Cc:* Jun Zhuang <jun.zhuang at hobsons.com>; hotspot-gc-use at openjdk.java.net
> *Subject:* Re: Seeking answer to a GC pattern
>
> What Grzegorz & Thomas said.
>
> Also you might take a heap dump before and after a full gc (-XX:+PrintClassHistogramBefore/AfterFullGC) to see the types that are reclaimed in the old gen. Might give you an idea as to the types of objects that got promoted and then later died, and hence whether avoidable nepotism is or is not a factor (and thence what you might null out to reduce such nepotism).
>
> Also, I guess what I meant was MTT=1. However, given that going from MTT=10 to MTT=2 didn't make any appreciable difference, MTT=2 to MTT=1 will not either.
>
> You might also consider increasing yr young gen size, but that will likely also increase your pause times since objects tend to either die quickly or survive for a long time, and increasing the young gen size will still not age objects sufficiently to cause an increase in mortality.
>
> How many CPU's (and GC threads) do you have? Does the ratio of "real" to "usr+sys" increase as "real" ramps up? Does the amount that is promoted stay constant? That might imply that something is interfering with parallelization of copying. Typically that means that there is a long skinny data structure, such as a singly linked list that is being copied, although why that list would become longer (in terms of longer times) is not clear. Does the sawtooth of minor gc times happen even with MTT=1 or AlwaysTenure? (Hint: How many young collections do you see between the major collections when you see the sawtooth in young collection times? How does it compare with the highest age of object that is kept in the survivor spaces?)
>
> -- ramki
>
> On Tue, Nov 24, 2015 at 9:17 AM, Grzegorz Molenda <molendag at gmail.com <mailto:molendag at gmail.com>> wrote:
>
>     Just a few tips:
>
>     Check OS stats for paging / swapping activity at both VM'and hypervisor levels.
>
>     Make sure the OS doesn't use transparent huge pages.
>
>     If the above two don't help, try enabling -XX:+PrintGCTaskTimeStamps to diagnose, which part of GC collecion takes the most time. Note values aren't reported in time units, but in ticks. Subtract one from the other reported per task . Compare between tasks per signle collection and check stats from a few collections in row, to get the idea, where it does degradate.
>
>
>     Thanks,
>
>
>     Grzegorz
>
>     2015-11-20 20:46 GMT+01:00 Jun Zhuang <jun.zhuang at hobsons.com <mailto:jun.zhuang at hobsons.com>>:
>
>         Hi Srinivas,
>
>         Thanks for your suggestion. I ran test with following parameters:
>
>         -server -XX:+UseCompressedOops -XX:+TieredCompilation -XX:ReservedCodeCacheSize=64m -XX:+UseCodeCacheFlushing -XX:+PrintTenuringDistribution -Xms2g -Xmx2g -XX:MaxPermSize=256m -XX:NewSize=128m -XX:MaxNewSize=128m -XX:SurvivorRatio=6 -XX:-UseAdaptiveSizePolicy -XX:+DisableExplicitGC -XX:MaxTenuringThreshold=2
>
>         But the -XX:MaxTenuringThreshold=2 setting does not seem to help anything. I am still seeing similar GC pattern as with the +AlwaysTenure, actually the young GC time is higher with MTT=2 (getting to 0.5 secs vs. 0.25 with AlwaysTenure).
>
>         Unless anyone else can provide another theory, I am convinced that nepotism is at play here. Changing the java startup parameters can only get me so far, dev will have to look at the code and see what can be done on the code level.
>
>         Thanks,
>
>         Jun
>
>         *From:*Srinivas Ramakrishna [mailto:ysr1729 at gmail.com <mailto:ysr1729 at gmail.com>]
>         *Sent:* Thursday, November 19, 2015 8:09 PM
>         *To:* Jun Zhuang <jun.zhuang at hobsons.com <mailto:jun.zhuang at hobsons.com>>
>         *Cc:* hotspot-gc-use at openjdk.java.net <mailto:hotspot-gc-use at openjdk.java.net>
>         *Subject:* Re: Seeking answer to a GC pattern
>
>         Use -XX:MaxTenuringThreshold=2 and you might see better behavior that +AlwaysTenure (which is almost always a very bad choice). That will at least reduce some of the nepotism issues from +AlwaysTenure that Thomas mentions. MTT > 2 is unlikely to help at your current frequency of minor collections since the mortality after age 1 is fairly low (from your tenuring distribution). Worth a quick test.
>
>         -- ramki
>
>         _______________________________________________
>         hotspot-gc-use mailing list
>         hotspot-gc-use at openjdk.java.net <mailto:hotspot-gc-use at openjdk.java.net>
>         http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>
>
>
> _______________________________________________
> hotspot-gc-use mailing list
> hotspot-gc-use at openjdk.java.net
> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>

From ysr1729 at gmail.com  Wed Nov 25 20:54:05 2015
From: ysr1729 at gmail.com (Srinivas Ramakrishna)
Date: Wed, 25 Nov 2015 12:54:05 -0800
Subject: Seeking answer to a GC pattern
In-Reply-To: <565604BF.9020303@Oracle.COM>
References: <BY1PR02MB1130E9D5724E9FECD90ABF9281230@BY1PR02MB1130.namprd02.prod.outlook.com>
	<CABzyjyn3Uy-tmNjr5xiybqw=BqO8e6uvrJYqHGaHdVAQLbY+Kw@mail.gmail.com>
	<SN1PR0201MB1854D703DFA93C283082373F811A0@SN1PR0201MB1854.namprd02.prod.outlook.com>
	<CAFoLFGHXEGcehn24K6x=YbSRrAGLV=mNBKkX84VNAK9dbYNLOA@mail.gmail.com>
	<CABzyjykPCzi=g_oeRGwGadwZCuDajdf9tQt6+D4kB+hzN1Frrg@mail.gmail.com>
	<SN1PR0201MB1854ACFD3D9270611F20BC3E81060@SN1PR0201MB1854.namprd02.prod.outlook.com>
	<565604BF.9020303@Oracle.COM>
Message-ID: <CABzyjymHfXGpQcGdfswpt4FqCTO-5vFCtNAT7+foYE37L3uQgw@mail.gmail.com>

Yes, sounds like some card scanning or BOT walking pathology perhaps? Are
you on the latest 8u66 or better for these numbers?
Did you also try 9? If you have a test case, it might make sense to file a
ticket, so someone might take a closer look.

-- ramki

On Wed, Nov 25, 2015 at 10:58 AM, Peter B. Kessler <
Peter.B.Kessler at oracle.com> wrote:

> On 11/24/15 01:21 PM, Jun Zhuang wrote:
>
>> Hi Srinivas,
>>
>> Appreciate your input. Following are answers to your questions. I?ll try
>> your other advices.
>>
>> -How many CPU's (and GC threads) do you have? Does the ratio of "real" to
>> "usr+sys" increase as "real" ramps up?
>>
>> 4 CPUs.
>>
>> Here is the time for one of the young GCs with +AlwaysTenure: Times:
>> user=1.45 sys=0.00, real=0.40 secs. The sys time is always close to 0, user
>> time is more than 3x real time and increases with real time accordingly.
>>
>> -Does the amount that is promoted stay constant?
>>
>> With +AlwaysTenure, looks like the promoted amount was fairly constant @
>> about 125K. Following shows the first 3 young GCs right after a full
>> collection and last 3 right before the next one.
>>
>> 328706.505: [GC [PSYoungGen: 129018K->0K(130048K)]
>> 286309K->160838K(2096896K), 0.0152110 secs] [Times: user=0.04 sys=0.01,
>> real=0.01 secs]
>>
>> 328711.687: [GC [PSYoungGen: 129024K->0K(130048K)]
>> 289862K->165092K(2096896K), 0.0199390 secs] [Times: user=0.06 sys=0.00,
>> real=0.02 secs]
>>
>> 328716.875: [GC [PSYoungGen: 129024K->0K(130048K)]
>> 294116K->168626K(2096896K), 0.0247520 secs] [Times: user=0.07 sys=0.00,
>> real=0.02 secs]
>>
>> ?
>>
>> 331103.140: [GC [PSYoungGen: 129024K->0K(130048K)]
>> 2082788K->1957116K(2096896K), 0.2220360 secs] [Times: user=0.78 sys=0.00,
>> real=0.23 secs]
>>
>> 331108.118: [GC [PSYoungGen: 129024K->0K(130048K)]
>> 2086140K->1960268K(2096896K), 0.2170640 secs] [Times: user=0.79 sys=0.01,
>> real=0.22 secs]
>>
>> 331113.074: [GC [PSYoungGen: 129024K->0K(130048K)]
>> 2089292K->1963948K(2096896K), 0.2132430 secs] [Times: user=0.79 sys=0.00,
>> real=0.21 secs]
>>
>> -Does the sawtooth of minor gc times happen even with MTT=1 or
>> AlwaysTenure?
>>
>> Yes. It always happens with or without AlwaysTenure.
>>
>> -(Hint: How many young collections do you see between the major
>> collections when you see the sawtooth in young collection times? How does
>> it compare with the highest age of object that is kept in the survivor
>> spaces?)
>>
>> For one of my tests with 128m young gen and +AlwaysTenure, the average #
>> of young GCs before a full collection was a little over 500.
>>
>> Sincerely,
>>
>> Jun
>>
>
> Looking at the increase in your heap size after each young generation
> collection seems to show that you are promoting 3MB~4MB at each young
> generation collection.  E.g., 165092K - 160838K = 4254K.  With your 1920MB
> old generation that would let you have 500 young generation collections
> between full collections, as you say.  If you were promoting only 125KB at
> each young generation collection your 1920MB old generation could absorb
> promotions from 15000 young generation collections.
>
> What is confusing is that times for the young generation collections
> increases proportionally with the size of the old generation.  Your
> sawtooth pattern.  Usually the time for a young generation collection is
> proportional to the amount of space that is promoted, which seems to be
> constant in your case.  That implies some cost proportional to the size of
> the old generation: but what?  It does not seem to take you longer to
> allocate through the space in your young generation when the old generation
> is empty than when it is full (~5 seconds) so it does not seem like you are
> doing more work when the old generation is full: e.g., dirtying cards for
> data that has piled up in the old generation, which would cause more work
> for the collector.
>
> With a 10x increase in time like that, one would think it would be easy to
> identify with a profiler, or detailed timers for phases, if there are any
> in the code.
>
> To Ramki: The parallelization seems to hold at somewhat over 3.
>
>                         ... peter
>
> *From:*Srinivas Ramakrishna [mailto:ysr1729 at gmail.com]
>> *Sent:* Tuesday, November 24, 2015 3:14 PM
>> *To:* Grzegorz Molenda <molendag at gmail.com>
>> *Cc:* Jun Zhuang <jun.zhuang at hobsons.com>;
>> hotspot-gc-use at openjdk.java.net
>> *Subject:* Re: Seeking answer to a GC pattern
>>
>> What Grzegorz & Thomas said.
>>
>> Also you might take a heap dump before and after a full gc
>> (-XX:+PrintClassHistogramBefore/AfterFullGC) to see the types that are
>> reclaimed in the old gen. Might give you an idea as to the types of objects
>> that got promoted and then later died, and hence whether avoidable nepotism
>> is or is not a factor (and thence what you might null out to reduce such
>> nepotism).
>>
>> Also, I guess what I meant was MTT=1. However, given that going from
>> MTT=10 to MTT=2 didn't make any appreciable difference, MTT=2 to MTT=1 will
>> not either.
>>
>> You might also consider increasing yr young gen size, but that will
>> likely also increase your pause times since objects tend to either die
>> quickly or survive for a long time, and increasing the young gen size will
>> still not age objects sufficiently to cause an increase in mortality.
>>
>> How many CPU's (and GC threads) do you have? Does the ratio of "real" to
>> "usr+sys" increase as "real" ramps up? Does the amount that is promoted
>> stay constant? That might imply that something is interfering with
>> parallelization of copying. Typically that means that there is a long
>> skinny data structure, such as a singly linked list that is being copied,
>> although why that list would become longer (in terms of longer times) is
>> not clear. Does the sawtooth of minor gc times happen even with MTT=1 or
>> AlwaysTenure? (Hint: How many young collections do you see between the
>> major collections when you see the sawtooth in young collection times? How
>> does it compare with the highest age of object that is kept in the survivor
>> spaces?)
>>
>> -- ramki
>>
>> On Tue, Nov 24, 2015 at 9:17 AM, Grzegorz Molenda <molendag at gmail.com
>> <mailto:molendag at gmail.com>> wrote:
>>
>>     Just a few tips:
>>
>>     Check OS stats for paging / swapping activity at both VM'and
>> hypervisor levels.
>>
>>     Make sure the OS doesn't use transparent huge pages.
>>
>>     If the above two don't help, try enabling -XX:+PrintGCTaskTimeStamps
>> to diagnose, which part of GC collecion takes the most time. Note values
>> aren't reported in time units, but in ticks. Subtract one from the other
>> reported per task . Compare between tasks per signle collection and check
>> stats from a few collections in row, to get the idea, where it does
>> degradate.
>>
>>
>>     Thanks,
>>
>>
>>     Grzegorz
>>
>>     2015-11-20 20:46 GMT+01:00 Jun Zhuang <jun.zhuang at hobsons.com
>> <mailto:jun.zhuang at hobsons.com>>:
>>
>>         Hi Srinivas,
>>
>>         Thanks for your suggestion. I ran test with following parameters:
>>
>>         -server -XX:+UseCompressedOops -XX:+TieredCompilation
>> -XX:ReservedCodeCacheSize=64m -XX:+UseCodeCacheFlushing
>> -XX:+PrintTenuringDistribution -Xms2g -Xmx2g -XX:MaxPermSize=256m
>> -XX:NewSize=128m -XX:MaxNewSize=128m -XX:SurvivorRatio=6
>> -XX:-UseAdaptiveSizePolicy -XX:+DisableExplicitGC -XX:MaxTenuringThreshold=2
>>
>>         But the -XX:MaxTenuringThreshold=2 setting does not seem to help
>> anything. I am still seeing similar GC pattern as with the +AlwaysTenure,
>> actually the young GC time is higher with MTT=2 (getting to 0.5 secs vs.
>> 0.25 with AlwaysTenure).
>>
>>         Unless anyone else can provide another theory, I am convinced
>> that nepotism is at play here. Changing the java startup parameters can
>> only get me so far, dev will have to look at the code and see what can be
>> done on the code level.
>>
>>         Thanks,
>>
>>         Jun
>>
>>         *From:*Srinivas Ramakrishna [mailto:ysr1729 at gmail.com <mailto:
>> ysr1729 at gmail.com>]
>>         *Sent:* Thursday, November 19, 2015 8:09 PM
>>         *To:* Jun Zhuang <jun.zhuang at hobsons.com <mailto:
>> jun.zhuang at hobsons.com>>
>>         *Cc:* hotspot-gc-use at openjdk.java.net <mailto:
>> hotspot-gc-use at openjdk.java.net>
>>         *Subject:* Re: Seeking answer to a GC pattern
>>
>>         Use -XX:MaxTenuringThreshold=2 and you might see better behavior
>> that +AlwaysTenure (which is almost always a very bad choice). That will at
>> least reduce some of the nepotism issues from +AlwaysTenure that Thomas
>> mentions. MTT > 2 is unlikely to help at your current frequency of minor
>> collections since the mortality after age 1 is fairly low (from your
>> tenuring distribution). Worth a quick test.
>>
>>         -- ramki
>>
>>         _______________________________________________
>>         hotspot-gc-use mailing list
>>         hotspot-gc-use at openjdk.java.net <mailto:
>> hotspot-gc-use at openjdk.java.net>
>>         http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>>
>>
>>
>> _______________________________________________
>> hotspot-gc-use mailing list
>> hotspot-gc-use at openjdk.java.net
>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20151125/6283f430/attachment.html>

From ysr1729 at gmail.com  Fri Nov 20 01:08:50 2015
From: ysr1729 at gmail.com (Srinivas Ramakrishna)
Date: Fri, 20 Nov 2015 01:08:50 -0000
Subject: Seeking answer to a GC pattern
In-Reply-To: <BY1PR02MB1130E9D5724E9FECD90ABF9281230@BY1PR02MB1130.namprd02.prod.outlook.com>
References: <BY1PR02MB1130E9D5724E9FECD90ABF9281230@BY1PR02MB1130.namprd02.prod.outlook.com>
Message-ID: <CABzyjyn3Uy-tmNjr5xiybqw=BqO8e6uvrJYqHGaHdVAQLbY+Kw@mail.gmail.com>

Use -XX:MaxTenuringThreshold=2 and you might see better behavior that
+AlwaysTenure (which is almost always a very bad choice). That will at
least reduce some of the nepotism issues from +AlwaysTenure that Thomas
mentions. MTT > 2 is unlikely to help at your current frequency of minor
collections since the mortality after age 1 is fairly low (from your
tenuring distribution). Worth a quick test.

-- ramki

On Mon, Oct 26, 2015 at 12:33 PM, Jun Zhuang <jun.zhuang at hobsons.com> wrote:

> Hi,
>
>
>
> When running performance testing for a java web service running on JBOSS,
> I observed a clear saw-tooth pattern in CPU utilization that closely
> follows the GC cycles. see below:
>
>
>
>
>
> Java startup parameters used:
>
>
>
> -XX:+TieredCompilation -XX:+PrintTenuringDistribution -Xms2048m -Xmx4096m
> -XX:MaxPermSize=256m -XX:NewSize=128m -XX:MaxNewSize=128m
> -XX:SurvivorRatio=126 -XX:-UseAdaptiveSizePolicy -XX:+DisableExplicitGC
> -XX:+AlwaysTenure
>
>
>
> With this set of parameters, the young GC pause time ranged from 0.02 to
> 0.25 secs. When I used 256m for the young gen, the young GC pause time
> ranged from 0.02 to 0.5 secs. My understanding is that the young GC pause
> time normally stays fairly stable, I have spent quite some time researching
> but have yet to find an answer to this behavior. I wonder if people in this
> distribution list can help me out?
>
>
>
> *Other related info*
>
>
>
> * Server Specs: VM with 4 CPUs and 8 Gb mem
>
> * Test setup:
>
> ?         # of Vusers: 100
>
> ?         Ramp up: 10 mins
>
> ?         Pacing: 5 ? 7 secs
>
> * I tried with all other available GC algorithms, tenuring thresholds,
> various sizes of the generations, but the AlwaysTenure parameter seems to
> work the best so far.
>
>
>
>
>
> Any input will be highly appreciated.
>
>
>
> Sincerely yours,
>
>
>
> *Jun Zhuang*
>
> *Sr. Performance QA Engineer | Hobsons*
>
> *513-746-2288 <513-746-2288> (work)*
>
> *513-227-7643 <513-227-7643> (mobile)*
>
>
>
> _______________________________________________
> hotspot-gc-use mailing list
> hotspot-gc-use at openjdk.java.net
> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20151120/71537517/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image010.jpg
Type: image/jpeg
Size: 11773 bytes
Desc: not available
URL: <http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20151120/71537517/image010.jpg>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image011.jpg
Type: image/jpeg
Size: 33498 bytes
Desc: not available
URL: <http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20151120/71537517/image011.jpg>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image012.jpg
Type: image/jpeg
Size: 19704 bytes
Desc: not available
URL: <http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20151120/71537517/image012.jpg>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image002.jpg
Type: image/jpeg
Size: 7083 bytes
Desc: not available
URL: <http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20151120/71537517/image002.jpg>