From renxijuninfo at gmail.com  Wed Jan  6 19:37:08 2010
From: renxijuninfo at gmail.com (renxijuninfo)
Date: Thu, 7 Jan 2010 11:37:08 +0800
Subject: why after about three days , Hotspot frequent running full GC
Message-ID: <201001071137050229531@gmail.com>

why after about three days , Hotspot frequent running full GC, although there are enough old space?

At first the GC is running good about 20 seconds run a young GC and about 15 minute run a full GC. But after about three days the full GC running about every 5 second. 
I use jstat monitor when running full GC the Old Space just used less then 60%.

This is jvm bug or others?

Thanks.


FYI:

# java -version
java version "1.6.0_11"
Java(TM) SE Runtime Environment (build 1.6.0_11-b03)
Java HotSpot(TM) 64-Bit Server VM (build 11.0-b16, mixed mode)

# uname -a
Linux 2.6.18-128.7.1.el5 #1 SMP Wed Aug 19 04:00:49 EDT 2009 x86_64 x86_64 x86_64 GNU/Linux

jps -lmv:

26938 com.caucho.server.resin.Resin -socketwait 58977 -server b -stdout /usr/local/resin/log/stdout.log -stderr /usr/local/resin/log/stderr.log 
-Xmx6144m
-Xms6144m 
-Xss512k 
-XX:PermSize=512M 
-XX:MaxPermSize=512m 
-XX:NewSize=3072m 
-XX:MaxNewSize=3072m 
-XX:SurvivorRatio=14
-XX:MaxTenuringThreshold=15 
-XX:GCTimeRatio=19 
-XX:+DisableExplicitGC 
-XX:+UseParNewGC 
-XX:+CMSScavengeBeforeRemark 
-XX:+UseConcMarkSweepGC 
-XX:+UseCMSCompactAtFullCollection 
-XX:+CMSClassUnloadingEnabled 
-XX:CMSInitiatingOccupancyFraction=80 
-XX:SoftRefLRUPolicyMSPerMB=0 
-XX:+PrintClassHistogram 
-XX:+PrintGCDetails 
-XX:+PrintGCTimeStamps 
-XX:+PrintTenuringDistribution 
-Xloggc:log/gc.log 
-Dcom.sun.management.jmxremote -Xss1m -Dresin.home=/usr/local/resin -Dserver.root=/usr/local/resin -Djava.util.logging.manager=com.caucho.log.LogManagerImpl -Djavax.management.builder.initial=com.caucho.jmx.MBeanServerBuilderImpl 


2010-01-07 


renxijuninfo 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20100107/9635d20e/attachment-0001.html 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: image/jpeg
Size: 157137 bytes
Desc: not available
Url : http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20100107/9635d20e/attachment-0001.jpe 

From Y.S.Ramakrishna at Sun.COM  Thu Jan  7 09:15:38 2010
From: Y.S.Ramakrishna at Sun.COM (Y. Srinivas Ramakrishna)
Date: Thu, 07 Jan 2010 09:15:38 -0800
Subject: why after about three days , Hotspot frequent running full GC
In-Reply-To: <201001071137050229531@gmail.com>
References: <201001071137050229531@gmail.com>
Message-ID: <4B4616BA.8000401@Sun.COM>

Difficult to tell. Did you check the perm gen occupancy?
GC logs with +PrintHeapAtGC might provide further clues as well.
What do your current gc logs say about perm gen occupancy (i believe
that would be printed from a full gc?)

-- ramki

On 01/06/10 19:37, renxijuninfo wrote:
> why after about three days , Hotspot frequent running full GC, although 
> there are enough old space?
> 
> At first the GC is running good about 20 seconds run a young GC and 
> about 15 minute run a full GC. But after about three days the full GC 
> running about every 5 second.
> I use jstat monitor when running full GC the Old Space just used less 
> then 60%.
> 
> This is jvm bug or others?
> 
> Thanks.
> 
> 
> FYI:
> 
> # java -version
> java version "1.6.0_11"
> Java(TM) SE Runtime Environment (build 1.6.0_11-b03)
> Java HotSpot(TM) 64-Bit Server VM (build 11.0-b16, mixed mode)
> 
> # uname -a
> Linux 2.6.18-128.7.1.el5 #1 SMP Wed Aug 19 04:00:49 EDT 2009 x86_64 
> x86_64 x86_64 GNU/Linux
> 
> jps -lmv:
> 
> 26938 com.caucho.server.resin.Resin -socketwait 58977 -server b -stdout 
> /usr/local/resin/log/stdout.log -stderr /usr/local/resin/log/stderr.log
> -Xmx6144m
> -Xms6144m
> -Xss512k
> -XX:PermSize=512M
> -XX:MaxPermSize=512m
> -XX:NewSize=3072m
> -XX:MaxNewSize=3072m
> -XX:SurvivorRatio=14
> -XX:MaxTenuringThreshold=15
> -XX:GCTimeRatio=19
> -XX:+DisableExplicitGC
> -XX:+UseParNewGC
> -XX:+CMSScavengeBeforeRemark
> -XX:+UseConcMarkSweepGC
> -XX:+UseCMSCompactAtFullCollection
> -XX:+CMSClassUnloadingEnabled
> -XX:CMSInitiatingOccupancyFraction=80
> -XX:SoftRefLRUPolicyMSPerMB=0
> -XX:+PrintClassHistogram
> -XX:+PrintGCDetails
> -XX:+PrintGCTimeStamps
> -XX:+PrintTenuringDistribution
> -Xloggc:log/gc.log
> -Dcom.sun.management.jmxremote -Xss1m -Dresin.home=/usr/local/resin 
> -Dserver.root=/usr/local/resin 
> -Djava.util.logging.manager=com.caucho.log.LogManagerImpl 
> -Djavax.management.builder.initial=com.caucho.jmx.MBeanServerBuilderImpl
>  
>  
>  
>  
>  
>  
> 2010-01-07
> ------------------------------------------------------------------------
> renxijuninfo
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> hotspot-gc-use mailing list
> hotspot-gc-use at openjdk.java.net
> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use


From shaun.hennessy at alcatel-lucent.com  Mon Jan 11 07:07:55 2010
From: shaun.hennessy at alcatel-lucent.com (Shaun Hennessy)
Date: Mon, 11 Jan 2010 10:07:55 -0500
Subject: CMS & DefaultMaxTenuringThreshold/SurvivorRatio
In-Reply-To: <4AC1493A.2030004@sun.com>
References: <dc6011320909281007k19f52f54v8d086edcd19b0bae@mail.gmail.com>	<4AC0EEAE.5010705@Sun.COM>	<dc6011320909281106r398fa9fak9384e89012d0c52d@mail.gmail.com>	<4AC145AE.30804@sun.com>
	<4AC148B6.7010608@Sun.COM> <4AC1493A.2030004@sun.com>
Message-ID: <4B4B3ECB.5090105@alcatel-lucent.com>

An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20100111/c751ca78/attachment.html 

From shaun.hennessy at alcatel-lucent.com  Mon Jan 11 07:31:17 2010
From: shaun.hennessy at alcatel-lucent.com (Shaun Hennessy)
Date: Mon, 11 Jan 2010 10:31:17 -0500
Subject: CMS & DefaultMaxTenuringThreshold/SurvivorRatio
In-Reply-To: <4B4B3ECB.5090105@alcatel-lucent.com>
References: <dc6011320909281007k19f52f54v8d086edcd19b0bae@mail.gmail.com>	<4AC0EEAE.5010705@Sun.COM>	<dc6011320909281106r398fa9fak9384e89012d0c52d@mail.gmail.com>	<4AC145AE.30804@sun.com>	<4AC148B6.7010608@Sun.COM>
	<4AC1493A.2030004@sun.com> <4B4B3ECB.5090105@alcatel-lucent.com>
Message-ID: <4B4B4445.5020209@alcatel-lucent.com>

An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20100111/495b3069/attachment.html 

From nikolay.diakov at fredhopper.com  Mon Jan 11 08:35:40 2010
From: nikolay.diakov at fredhopper.com (Nikolay Diakov)
Date: Mon, 11 Jan 2010 17:35:40 +0100
Subject: unusually long GC latencies
Message-ID: <4B4B535C.50704@fredhopper.com>

Dear all,

We have a server application, which under some load seems to do some 
unusually long garbage collections that our clients experience as a 
service denied for 2+ minutes around each moment of the long garbage 
collection. Upon further look we see in the GC log that these happen 
during young generation cleanups:

GC log lines examplifying the issue
-----------------------------------
server1:
290059.768: [GC [PSYoungGen: 1804526K->83605K(2047744K)] 
4574486K->2873501K(5030720K), 0.0220940 secs] [Times: user=0.22 
sys=0.00, real=0.03 secs]
290297.525: [GC [PSYoungGen: 1790101K->92128K(2070592K)] 
4579997K->2882024K(5053568K), 183.0278480 secs] [Times: user=0.21 
sys=0.12, real=183.03 secs]
290575.578: [GC [PSYoungGen: 1831520K->57896K(2057792K)] 
4621416K->2892809K(5040768K), 0.0245410 secs] [Times: user=0.26 
sys=0.01, real=0.03 secs]

server2:
15535.809: [GC 15535.809: [ParNew: 2233185K->295969K(2340032K), 
0.0327680 secs] 4854859K->2917643K(5313116K), 0.0328570 secs] [Times: 
user=0.37 sys=0.00, real=0.03 secs]
15628.580: [GC 15628.580: [ParNew: 2246049K->301210K(2340032K), 
185.3817530 secs] 4867723K->2944629K(5313116K), 185.3836340 secs] 
[Times: user=0.38 sys=0.16, real=185.38 secs]
15821.983: [GC 15821.983: [ParNew: 2251290K->341977K(2340032K), 
0.0513300 secs] 4894709K->3012947K(5313116K), 0.0514380 secs] [Times: 
user=0.48 sys=0.03, real=0.05 secs]

If interested, I have the full logs - the issue appears 4-5 times in a day.

We have tried tuning the GC options of our server application, so far 
without success - we still get the same long collections. We suspect 
this happens because of the 16 core machine, because when we perform the 
same test on a 8 core machine, we do not observe the long garbage 
collections at all.

We also tried both the CMS and the ParallelGC algorithms - we get the 
delays on the 16 core machine in both cases, and in both cases the 
server works OK on the 8 core machine.

* Would you have a look what goes on? Below I post the information about 
the OS, Java version, application and HW. Attached find the GC logs from 
two servers - one running the CMS and the other running ParallelGC.

OS
--
Ubuntu 9.04.3 LTS

uname -a
Linux fas1 2.6.24-25-server #1 SMP Tue Oct 20 07:20:02 UTC 2009 x86_64 
GNU/Linux

HW
--
POWEREDGE R710 RACK CHASSIS with 4X INTEL XEON E5520 PROCESSOR 2.26GHZ
12GB MEMORY (6X2GB DUAL RANK)
2 x 300GB SAS 15K 3.5" HD HOT PLUG IN RAID-1
PERC 6/I RAID CONTROLLER CARD 256MB PCIE

Java
----
java version "1.6.0_16"
Java(TM) SE Runtime Environment (build 1.6.0_16-b01)
Java HotSpot(TM) 64-Bit Server VM (build 14.2-b01, mixed mode)

VM settings server1
-------------------
-server -Xms2461m -Xmx7000m -XX:MaxPermSize=128m -XX:+UseParallelGC 
-XX:+UseParallelOldGC -XX:ParallelGCThreads=10 -XX:MaxNewSize=2333m 
-XX:NewSize=2333m -XX:SurvivorRatio=5 -XX:+DisableExplicitGC 
-XX:+UseBiasedLocking -XX:+UseMembar -verbosegc -XX:+PrintGCDetails

VM settings server2
-------------------
-server -Xms2794m -Xmx8000m -XX:MaxPermSize=128m 
-XX:+CMSClassUnloadingEnabled -XX:+UseConcMarkSweepGC -XX:+UseParNewGC 
-XX:+CMSParallelRemarkEnabled -XX:MaxNewSize
=2666m -XX:SurvivorRatio=5 -XX:+DisableExplicitGC -XX:+UseBiasedLocking 
-XX:+UseMembar -verbosegc -XX:+PrintGCDetails

We have also observed the issue without all GC-related settings, thus 
the VM running with defaults - same result.

Application:
-----------
Our application runs in a JBoss container and processes http requests. 
We use up to 50 processing threads. We perform about 10-20 request 
processings per second during the executed test that exposed the issue. 
Our server produces quite substantial amount of garbage.

Yours sincerely,
   Nikolay Diakov
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20100111/9e49acf3/attachment.html 

From nikolay.diakov at fredhopper.com  Mon Jan 11 08:44:51 2010
From: nikolay.diakov at fredhopper.com (Nikolay Diakov)
Date: Mon, 11 Jan 2010 17:44:51 +0100
Subject: unusually long GC latencies
In-Reply-To: <4B4B535C.50704@fredhopper.com>
References: <4B4B535C.50704@fredhopper.com>
Message-ID: <4B4B5583.8050708@fredhopper.com>

Linux version is actually Ubuntu 8.04.03 LTS

On 11-01-10 17:35, Nikolay Diakov wrote:
> Dear all,
>
> We have a server application, which under some load seems to do some 
> unusually long garbage collections that our clients experience as a 
> service denied for 2+ minutes around each moment of the long garbage 
> collection. Upon further look we see in the GC log that these happen 
> during young generation cleanups:
>
> GC log lines examplifying the issue
> -----------------------------------
> server1:
> 290059.768: [GC [PSYoungGen: 1804526K->83605K(2047744K)] 
> 4574486K->2873501K(5030720K), 0.0220940 secs] [Times: user=0.22 
> sys=0.00, real=0.03 secs]
> 290297.525: [GC [PSYoungGen: 1790101K->92128K(2070592K)] 
> 4579997K->2882024K(5053568K), 183.0278480 secs] [Times: user=0.21 
> sys=0.12, real=183.03 secs]
> 290575.578: [GC [PSYoungGen: 1831520K->57896K(2057792K)] 
> 4621416K->2892809K(5040768K), 0.0245410 secs] [Times: user=0.26 
> sys=0.01, real=0.03 secs]
>
> server2:
> 15535.809: [GC 15535.809: [ParNew: 2233185K->295969K(2340032K), 
> 0.0327680 secs] 4854859K->2917643K(5313116K), 0.0328570 secs] [Times: 
> user=0.37 sys=0.00, real=0.03 secs]
> 15628.580: [GC 15628.580: [ParNew: 2246049K->301210K(2340032K), 
> 185.3817530 secs] 4867723K->2944629K(5313116K), 185.3836340 secs] 
> [Times: user=0.38 sys=0.16, real=185.38 secs]
> 15821.983: [GC 15821.983: [ParNew: 2251290K->341977K(2340032K), 
> 0.0513300 secs] 4894709K->3012947K(5313116K), 0.0514380 secs] [Times: 
> user=0.48 sys=0.03, real=0.05 secs]
>
> If interested, I have the full logs - the issue appears 4-5 times in a 
> day.
>
> We have tried tuning the GC options of our server application, so far 
> without success - we still get the same long collections. We suspect 
> this happens because of the 16 core machine, because when we perform 
> the same test on a 8 core machine, we do not observe the long garbage 
> collections at all.
>
> We also tried both the CMS and the ParallelGC algorithms - we get the 
> delays on the 16 core machine in both cases, and in both cases the 
> server works OK on the 8 core machine.
>
> * Would you have a look what goes on? Below I post the information 
> about the OS, Java version, application and HW. Attached find the GC 
> logs from two servers - one running the CMS and the other running 
> ParallelGC.
>
> OS
> --
> Ubuntu 9.04.3 LTS
>
> uname -a
> Linux fas1 2.6.24-25-server #1 SMP Tue Oct 20 07:20:02 UTC 2009 x86_64 
> GNU/Linux
>
> HW
> --
> POWEREDGE R710 RACK CHASSIS with 4X INTEL XEON E5520 PROCESSOR 2.26GHZ
> 12GB MEMORY (6X2GB DUAL RANK)
> 2 x 300GB SAS 15K 3.5" HD HOT PLUG IN RAID-1
> PERC 6/I RAID CONTROLLER CARD 256MB PCIE
>
> Java
> ----
> java version "1.6.0_16"
> Java(TM) SE Runtime Environment (build 1.6.0_16-b01)
> Java HotSpot(TM) 64-Bit Server VM (build 14.2-b01, mixed mode)
>
> VM settings server1
> -------------------
> -server -Xms2461m -Xmx7000m -XX:MaxPermSize=128m -XX:+UseParallelGC 
> -XX:+UseParallelOldGC -XX:ParallelGCThreads=10 -XX:MaxNewSize=2333m 
> -XX:NewSize=2333m -XX:SurvivorRatio=5 -XX:+DisableExplicitGC 
> -XX:+UseBiasedLocking -XX:+UseMembar -verbosegc -XX:+PrintGCDetails
>
> VM settings server2
> -------------------
> -server -Xms2794m -Xmx8000m -XX:MaxPermSize=128m 
> -XX:+CMSClassUnloadingEnabled -XX:+UseConcMarkSweepGC -XX:+UseParNewGC 
> -XX:+CMSParallelRemarkEnabled -XX:MaxNewSize
> =2666m -XX:SurvivorRatio=5 -XX:+DisableExplicitGC 
> -XX:+UseBiasedLocking -XX:+UseMembar -verbosegc -XX:+PrintGCDetails
>
> We have also observed the issue without all GC-related settings, thus 
> the VM running with defaults - same result.
>
> Application:
> -----------
> Our application runs in a JBoss container and processes http requests. 
> We use up to 50 processing threads. We perform about 10-20 request 
> processings per second during the executed test that exposed the 
> issue. Our server produces quite substantial amount of garbage.
>
> Yours sincerely,
>   Nikolay Diakov
>
>
> _______________________________________________
> hotspot-gc-use mailing list
> hotspot-gc-use at openjdk.java.net
> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>    

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20100111/1cd0b5e7/attachment-0001.html 

From Jon.Masamitsu at Sun.COM  Mon Jan 11 09:00:42 2010
From: Jon.Masamitsu at Sun.COM (Jon Masamitsu)
Date: Mon, 11 Jan 2010 09:00:42 -0800
Subject: unusually long GC latencies
In-Reply-To: <4B4B535C.50704@fredhopper.com>
References: <4B4B535C.50704@fredhopper.com>
Message-ID: <4B4B593A.1030705@sun.com>


Nikolay,

The line

290297.525: [GC [PSYoungGen: 1790101K->92128K(2070592K)]
4579997K->2882024K(5053568K), 183.0278480 secs] [Times: user=0.21
sys=0.12, real=183.03 secs]

says that the user time is 0.21 (which is in line with the collections
before and after), the system time in 0.12 which is a jump up from
the before and after collections).  And that the real time is 183.02
which is the problem.    I would guess that the process is waiting
for something.   Might it be swapping?

Jon

Nikolay Diakov wrote On 01/11/10 08:35,:

> Dear all,
>
> We have a server application, which under some load seems to do some
> unusually long garbage collections that our clients experience as a
> service denied for 2+ minutes around each moment of the long garbage
> collection. Upon further look we see in the GC log that these happen
> during young generation cleanups:
>
> GC log lines examplifying the issue
> -----------------------------------
> server1:
> 290059.768: [GC [PSYoungGen: 1804526K->83605K(2047744K)]
> 4574486K->2873501K(5030720K), 0.0220940 secs] [Times: user=0.22
> sys=0.00, real=0.03 secs]
> 290297.525: [GC [PSYoungGen: 1790101K->92128K(2070592K)]
> 4579997K->2882024K(5053568K), 183.0278480 secs] [Times: user=0.21
> sys=0.12, real=183.03 secs]
> 290575.578: [GC [PSYoungGen: 1831520K->57896K(2057792K)]
> 4621416K->2892809K(5040768K), 0.0245410 secs] [Times: user=0.26
> sys=0.01, real=0.03 secs]
>
> server2:
> 15535.809: [GC 15535.809: [ParNew: 2233185K->295969K(2340032K),
> 0.0327680 secs] 4854859K->2917643K(5313116K), 0.0328570 secs] [Times:
> user=0.37 sys=0.00, real=0.03 secs]
> 15628.580: [GC 15628.580: [ParNew: 2246049K->301210K(2340032K),
> 185.3817530 secs] 4867723K->2944629K(5313116K), 185.3836340 secs]
> [Times: user=0.38 sys=0.16, real=185.38 secs]
> 15821.983: [GC 15821.983: [ParNew: 2251290K->341977K(2340032K),
> 0.0513300 secs] 4894709K->3012947K(5313116K), 0.0514380 secs] [Times:
> user=0.48 sys=0.03, real=0.05 secs]
>
> If interested, I have the full logs - the issue appears 4-5 times in a
> day.
>
> We have tried tuning the GC options of our server application, so far
> without success - we still get the same long collections. We suspect
> this happens because of the 16 core machine, because when we perform
> the same test on a 8 core machine, we do not observe the long garbage
> collections at all.
>
> We also tried both the CMS and the ParallelGC algorithms - we get the
> delays on the 16 core machine in both cases, and in both cases the
> server works OK on the 8 core machine.
>
> * Would you have a look what goes on? Below I post the information
> about the OS, Java version, application and HW. Attached find the GC
> logs from two servers - one running the CMS and the other running
> ParallelGC.
>
> OS
> --
> Ubuntu 9.04.3 LTS
>
> uname -a
> Linux fas1 2.6.24-25-server #1 SMP Tue Oct 20 07:20:02 UTC 2009 x86_64
> GNU/Linux
>
> HW
> --
> POWEREDGE R710 RACK CHASSIS with 4X INTEL XEON E5520 PROCESSOR 2.26GHZ
> 12GB MEMORY (6X2GB DUAL RANK)
> 2 x 300GB SAS 15K 3.5" HD HOT PLUG IN RAID-1
> PERC 6/I RAID CONTROLLER CARD 256MB PCIE
>
> Java
> ----
> java version "1.6.0_16"
> Java(TM) SE Runtime Environment (build 1.6.0_16-b01)
> Java HotSpot(TM) 64-Bit Server VM (build 14.2-b01, mixed mode)
>
> VM settings server1
> -------------------
> -server -Xms2461m -Xmx7000m -XX:MaxPermSize=128m -XX:+UseParallelGC
> -XX:+UseParallelOldGC -XX:ParallelGCThreads=10 -XX:MaxNewSize=2333m
> -XX:NewSize=2333m -XX:SurvivorRatio=5 -XX:+DisableExplicitGC
> -XX:+UseBiasedLocking -XX:+UseMembar -verbosegc -XX:+PrintGCDetails
>
> VM settings server2
> -------------------
> -server -Xms2794m -Xmx8000m -XX:MaxPermSize=128m
> -XX:+CMSClassUnloadingEnabled -XX:+UseConcMarkSweepGC -XX:+UseParNewGC
> -XX:+CMSParallelRemarkEnabled -XX:MaxNewSize
> =2666m -XX:SurvivorRatio=5 -XX:+DisableExplicitGC
> -XX:+UseBiasedLocking -XX:+UseMembar -verbosegc -XX:+PrintGCDetails
>
> We have also observed the issue without all GC-related settings, thus
> the VM running with defaults - same result.
>
> Application:
> -----------
> Our application runs in a JBoss container and processes http requests.
> We use up to 50 processing threads. We perform about 10-20 request
> processings per second during the executed test that exposed the
> issue. Our server produces quite substantial amount of garbage.
>
> Yours sincerely,
>   Nikolay Diakov
>
>------------------------------------------------------------------------
>
>_______________________________________________
>hotspot-gc-use mailing list
>hotspot-gc-use at openjdk.java.net
>http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>  
>


From shaun.hennessy at alcatel-lucent.com  Mon Jan 11 13:09:16 2010
From: shaun.hennessy at alcatel-lucent.com (Shaun Hennessy)
Date: Mon, 11 Jan 2010 16:09:16 -0500
Subject: CMS & DefaultMaxTenuringThreshold/SurvivorRatio
In-Reply-To: <4B4B3ECB.5090105@alcatel-lucent.com>
References: <dc6011320909281007k19f52f54v8d086edcd19b0bae@mail.gmail.com>	<4AC0EEAE.5010705@Sun.COM>	<dc6011320909281106r398fa9fak9384e89012d0c52d@mail.gmail.com>	<4AC145AE.30804@sun.com>	<4AC148B6.7010608@Sun.COM>
	<4AC1493A.2030004@sun.com> <4B4B3ECB.5090105@alcatel-lucent.com>
Message-ID: <4B4B937C.4080907@alcatel-lucent.com>

An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20100111/40aaecd2/attachment.html 

From Jon.Masamitsu at Sun.COM  Mon Jan 11 14:01:32 2010
From: Jon.Masamitsu at Sun.COM (Jon Masamitsu)
Date: Mon, 11 Jan 2010 14:01:32 -0800
Subject: CMS & DefaultMaxTenuringThreshold/SurvivorRatio
In-Reply-To: <4B4B937C.4080907@alcatel-lucent.com>
References: <dc6011320909281007k19f52f54v8d086edcd19b0bae@mail.gmail.com>
	<4AC0EEAE.5010705@Sun.COM>
	<dc6011320909281106r398fa9fak9384e89012d0c52d@mail.gmail.com>
	<4AC145AE.30804@sun.com> <4AC148B6.7010608@Sun.COM>
	<4AC1493A.2030004@sun.com> <4B4B3ECB.5090105@alcatel-lucent.com>
	<4B4B937C.4080907@alcatel-lucent.com>
Message-ID: <4B4B9FBC.4040103@sun.com>

Shaun Hennessy wrote On 01/11/10 13:09,:

> Alright I guess I am getting different behavior when I removed the
> parameters (and probably went from CMS-> Throughput).
>
> I now see with the parameters below, and with
> SurvivorRatio/MaxTenuringThreshold removed
> that I in fact get a SurvivorRatio of 6, and a MaxTenuringThreshold of
> 4.   Both of these seem to be static
> -- ie my Eden is staying at 3GB, my Survivor spaces at 0.5GB each, and
> PrintTenuringDistrubtion shows
> it at Max=4 - and nothing changes

>
> Will this always be the case (parameter won't change on the fly?) or
> is there some factor that will
> cause the JVM to change the parameters / I'm wondering why the
> Throughput collector seems to
> behave differently from CMS -- ie a different default
> MaxTenuringThreshold and the fact the memory
> pools seem to resize on the fly


Only the througput collector has GC ergonomics implemented.  That's
the feature that would vary eden vs survivor sizes and the tenuring
threshold.

>
> Also still curious if XX:ParallelGCThreads should be set to 16
> (#cpus)-- if my desire is to minimize time spent
> in STW GC time?


Yes 16 on average will minimize the STW GC pauses but occasionally
(pretty rare actually), there can be some interaction between the GC using
all the hardware threads and the OS needing one.

>  
> thanks,
> Shaun
>
>
>
>
> Shaun Hennessy wrote:
>
>> Hi,
>> Currently in java app we have the following settings applied, running
>> 6u12.
>>
>> *-*XX:+DisableExplicitGC
>> -XX:+UseConcMarkSweepGC
>> -XX:+UseParmNewGC
>> -XX:+CMSCompactAtFullCollection
>> -XX:+CMSClassUnloadingEnabled
>> -XX:+CMSInitatingOccupancyFractor=75
>> - PermSize/MaxPermSize=1GB
>> - Xms/Xms=16G
>> - NewSize/MaxNewSize=4GB
>> -XX:+SurvivorRatio=128
>>  -XX:+ MaxTenuringThreshold = 0
>>
>> So we are currently not using survivor space.   We are contemplating
>> using them now to hopefully lessen
>> the impacts of our major collections.   If I simply remove these 2
>> options, restart, query the jvm via jconsole I see the following
>> defaults.
>>
>> -XX:+SurvivorRatio=6
>> -XX:+MaxTenuringThreshold=15
>>
>> Are these settings actually being used CMS? - When watching jcsonole 
>> I see that our Eden and Survivor spaces
>> seem to frequently resize so I assume not or there must be another
>> parameter in play.
>> Will my MaxTenuringThreshold actually be 15?   Can this ever change
>> on the fly?
>>
>> One more thing -- can I believe the following as reported by jconsole
>> (running 6u12, 16 cpu box)
>> (I thought the formula was GCThreads=# of Cpus? )
>>
>> -XX:CMSParallelRemarkEnabled=true
>> -XX:UseBiasedLocking=true
>> -XX:CMSConcurrentMTEnabled=true
>> -XX:ParallelGCThreads=13
>> -XX:ParallelCMSThreads=4
>>
>> thanks,
>> Shaun
>>
>>------------------------------------------------------------------------
>>
>>_______________________________________________
>>hotspot-gc-use mailing list
>>hotspot-gc-use at openjdk.java.net
>>http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>>  
>>
>
>------------------------------------------------------------------------
>
>_______________________________________________
>hotspot-gc-use mailing list
>hotspot-gc-use at openjdk.java.net
>http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>  
>


From shaun.hennessy at alcatel-lucent.com  Mon Jan 11 14:14:13 2010
From: shaun.hennessy at alcatel-lucent.com (Shaun Hennessy)
Date: Mon, 11 Jan 2010 17:14:13 -0500
Subject: CMS & DefaultMaxTenuringThreshold/SurvivorRatio
In-Reply-To: <4B4B9FBC.4040103@sun.com>
References: <dc6011320909281007k19f52f54v8d086edcd19b0bae@mail.gmail.com>
	<4AC0EEAE.5010705@Sun.COM>
	<dc6011320909281106r398fa9fak9384e89012d0c52d@mail.gmail.com>
	<4AC145AE.30804@sun.com> <4AC148B6.7010608@Sun.COM>
	<4AC1493A.2030004@sun.com> <4B4B3ECB.5090105@alcatel-lucent.com>
	<4B4B937C.4080907@alcatel-lucent.com> <4B4B9FBC.4040103@sun.com>
Message-ID: <4B4BA2B5.4050407@alcatel-lucent.com>

An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20100111/69000e20/attachment.html 

From Jon.Masamitsu at Sun.COM  Mon Jan 11 14:27:29 2010
From: Jon.Masamitsu at Sun.COM (Jon Masamitsu)
Date: Mon, 11 Jan 2010 14:27:29 -0800
Subject: CMS & DefaultMaxTenuringThreshold/SurvivorRatio
In-Reply-To: <4B4BA2B5.4050407@alcatel-lucent.com>
References: <dc6011320909281007k19f52f54v8d086edcd19b0bae@mail.gmail.com>
	<4AC0EEAE.5010705@Sun.COM>
	<dc6011320909281106r398fa9fak9384e89012d0c52d@mail.gmail.com>
	<4AC145AE.30804@sun.com> <4AC148B6.7010608@Sun.COM>
	<4AC1493A.2030004@sun.com> <4B4B3ECB.5090105@alcatel-lucent.com>
	<4B4B937C.4080907@alcatel-lucent.com>
	<4B4B9FBC.4040103@sun.com> <4B4BA2B5.4050407@alcatel-lucent.com>
Message-ID: <4B4BA5D1.9080503@sun.com>

Shaun Hennessy wrote On 01/11/10 14:14,:

> Ah thanks, I didn't realize the ergonomics applied only to througput
> collector, that
> answers a few more questions.  Does that apply to Old:Young breakdown?  If
> I remove the parameters
>
>NewSize/MaxNewSize
>
>(and still using CMS) will I see the sizes of the pools get 
>resized as times goes on or will they remain constant?  
>  
>

The sizes of the generations will change but it is according
to a different policy which uses the MinHeapFreeRatio and
MaxHeapFreeRatio values.

  product(uintx, MinHeapFreeRatio,    40,
          "Min percentage of heap free after GC to avoid expansion")
                                                                           
  product(uintx, MaxHeapFreeRatio,    70,
          "Max percentage of heap free after GC to avoid shrinking")

With the default values the GC will grow a generation (bounded
by it maximum) if there is less than 40% free space in the generation.
Conversely, it will shrink a generation (bounded below by its
minimum) if more than 70% of it is free.

>thanks,
>Shaun
>
>  
>
>
> Jon Masamitsu wrote:
>
>>Shaun Hennessy wrote On 01/11/10 13:09,:
>>
>>  
>>
>>>Alright I guess I am getting different behavior when I removed the
>>>parameters (and probably went from CMS-> Throughput).
>>>
>>>I now see with the parameters below, and with
>>>SurvivorRatio/MaxTenuringThreshold removed
>>>that I in fact get a SurvivorRatio of 6, and a MaxTenuringThreshold of
>>>4.   Both of these seem to be static
>>>-- ie my Eden is staying at 3GB, my Survivor spaces at 0.5GB each, and
>>>PrintTenuringDistrubtion shows
>>>it at Max=4 - and nothing changes
>>>    
>>>
>>
>>  
>>
>>>Will this always be the case (parameter won't change on the fly?) or
>>>is there some factor that will
>>>cause the JVM to change the parameters / I'm wondering why the
>>>Throughput collector seems to
>>>behave differently from CMS -- ie a different default
>>>MaxTenuringThreshold and the fact the memory
>>>pools seem to resize on the fly
>>>    
>>>
>>
>>
>>Only the througput collector has GC ergonomics implemented.  That's
>>the feature that would vary eden vs survivor sizes and the tenuring
>>threshold.
>>
>>  
>>
>>>Also still curious if XX:ParallelGCThreads should be set to 16
>>>(#cpus)-- if my desire is to minimize time spent
>>>in STW GC time?
>>>    
>>>
>>
>>
>>Yes 16 on average will minimize the STW GC pauses but occasionally
>>(pretty rare actually), there can be some interaction between the GC using
>>all the hardware threads and the OS needing one.
>>
>>  
>>
>>> 
>>>thanks,
>>>Shaun
>>>
>>>
>>>
>>>
>>>Shaun Hennessy wrote:
>>>
>>>    
>>>
>>>>Hi,
>>>>Currently in java app we have the following settings applied, running
>>>>6u12.
>>>>
>>>>*-*XX:+DisableExplicitGC
>>>>-XX:+UseConcMarkSweepGC
>>>>-XX:+UseParmNewGC
>>>>-XX:+CMSCompactAtFullCollection
>>>>-XX:+CMSClassUnloadingEnabled
>>>>-XX:+CMSInitatingOccupancyFractor=75
>>>>- PermSize/MaxPermSize=1GB
>>>>- Xms/Xms=16G
>>>>- NewSize/MaxNewSize=4GB
>>>>-XX:+SurvivorRatio=128
>>>> -XX:+ MaxTenuringThreshold = 0
>>>>
>>>>So we are currently not using survivor space.   We are contemplating
>>>>using them now to hopefully lessen
>>>>the impacts of our major collections.   If I simply remove these 2
>>>>options, restart, query the jvm via jconsole I see the following
>>>>defaults.
>>>>
>>>>-XX:+SurvivorRatio=6
>>>>-XX:+MaxTenuringThreshold=15
>>>>
>>>>Are these settings actually being used CMS? - When watching jcsonole 
>>>>I see that our Eden and Survivor spaces
>>>>seem to frequently resize so I assume not or there must be another
>>>>parameter in play.
>>>>Will my MaxTenuringThreshold actually be 15?   Can this ever change
>>>>on the fly?
>>>>
>>>>One more thing -- can I believe the following as reported by jconsole
>>>>(running 6u12, 16 cpu box)
>>>>(I thought the formula was GCThreads=# of Cpus? )
>>>>
>>>>-XX:CMSParallelRemarkEnabled=true
>>>>-XX:UseBiasedLocking=true
>>>>-XX:CMSConcurrentMTEnabled=true
>>>>-XX:ParallelGCThreads=13
>>>>-XX:ParallelCMSThreads=4
>>>>
>>>>thanks,
>>>>Shaun
>>>>
>>>>------------------------------------------------------------------------
>>>>
>>>>_______________________________________________
>>>>hotspot-gc-use mailing list
>>>>hotspot-gc-use at openjdk.java.net
>>>>http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>>>> 
>>>>
>>>>      
>>>>
>>>------------------------------------------------------------------------
>>>
>>>_______________________________________________
>>>hotspot-gc-use mailing list
>>>hotspot-gc-use at openjdk.java.net
>>>http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>>> 
>>>
>>>    
>>>
>>
>>  
>>
>


From Y.S.Ramakrishna at Sun.COM  Mon Jan 11 14:31:11 2010
From: Y.S.Ramakrishna at Sun.COM (Y. Srinivas Ramakrishna)
Date: Mon, 11 Jan 2010 14:31:11 -0800
Subject: CMS & DefaultMaxTenuringThreshold/SurvivorRatio
In-Reply-To: <4B4B9FBC.4040103@sun.com>
References: <dc6011320909281007k19f52f54v8d086edcd19b0bae@mail.gmail.com>
	<4AC0EEAE.5010705@Sun.COM>
	<dc6011320909281106r398fa9fak9384e89012d0c52d@mail.gmail.com>
	<4AC145AE.30804@sun.com> <4AC148B6.7010608@Sun.COM>
	<4AC1493A.2030004@sun.com> <4B4B3ECB.5090105@alcatel-lucent.com>
	<4B4B937C.4080907@alcatel-lucent.com> <4B4B9FBC.4040103@sun.com>
Message-ID: <4B4BA6AF.5080300@sun.com>

Hi Shaun --

Jon Masamitsu wrote:
> 
> Only the througput collector has GC ergonomics implemented.  That's
> the feature that would vary eden vs survivor sizes and the tenuring
> threshold.

Just to clarify, that should read "_max_ tenuring threshold" above.
The tenuring threshold itself is indeed adaptively varied from
one scavenge to the next (based on survivor space size and object
survival demographics, using Ungar's adaptive tenuring algorithm)
by CMS and the serial collector. A different scheme determine the
tenuring threshold used per scavenge by Parallel GC.
Ask again if you want to know the difference between per-scavenge
tenuring threshold (which is adaptively varied) and
_max_ tenuring threshold (which is spec'd on the command-line).

But yes the rest of the things, heap size, shape, and _max_ tenuring threshold
would need to be manually tuned for optimal performance of CMS. Read the GC
tuning guide for how you might tune the survivor space size and
max tenuring threshold for your application using PrintTenuringDistribution data.

> 
>> Also still curious if XX:ParallelGCThreads should be set to 16
>> (#cpus)-- if my desire is to minimize time spent
>> in STW GC time?
> 
> 
> Yes 16 on average will minimize the STW GC pauses but occasionally
> (pretty rare actually), there can be some interaction between the GC using
> all the hardware threads and the OS needing one.

I have occasionally found that unless you have really large Eden sizes, fewer
GC threads than CPU's often give you the best results. But yes you are in the
right ball-park by growing the number of GC threads as your cpu count, cache size per cpu
and heap size increase. With a 4GB young gen as you have, i'd try 8 through 16
gc threads to see what works best.

-- ramki

From Jon.Masamitsu at Sun.COM  Mon Jan 11 19:06:10 2010
From: Jon.Masamitsu at Sun.COM (Jon Masamitsu)
Date: Mon, 11 Jan 2010 19:06:10 -0800
Subject: unusually long GC latencies
In-Reply-To: <69f3b8db1001111755r4f71dc1dy93e4b82508c21589@mail.gmail.com>
References: <4B4B535C.50704@fredhopper.com> <4B4B593A.1030705@sun.com>
	<69f3b8db1001111755r4f71dc1dy93e4b82508c21589@mail.gmail.com>
Message-ID: <4B4BE722.2090201@sun.com>

??? wrote On 01/11/10 17:55,:

> We have encountered the same problem sometimes. 
>
> Can you express what the user time, sys time and real time in detail? 


It is meant to be the same as the unix times(2)  output.  On
windows the output from GetProcessTimes() is used.


>
> On Tue, Jan 12, 2010 at 1:00 AM, Jon Masamitsu <Jon.Masamitsu at sun.com
> <mailto:Jon.Masamitsu at sun.com>> wrote:
>
>
>     Nikolay,
>
>     The line
>
>     290297.525: [GC [PSYoungGen: 1790101K->92128K(2070592K)]
>     4579997K->2882024K(5053568K), 183.0278480 secs] [Times: user=0.21
>     sys=0.12, real=183.03 secs]
>
>     says that the user time is 0.21 (which is in line with the collections
>     before and after), the system time in 0.12 which is a jump up from
>     the before and after collections).  And that the real time is 183.02
>     which is the problem.    I would guess that the process is waiting
>     for something.   Might it be swapping?
>
>     Jon
>
>     Nikolay Diakov wrote On 01/11/10 08:35,:
>
>     > Dear all,
>     >
>     > We have a server application, which under some load seems to do some
>     > unusually long garbage collections that our clients experience as a
>     > service denied for 2+ minutes around each moment of the long garbage
>     > collection. Upon further look we see in the GC log that these happen
>     > during young generation cleanups:
>     >
>     > GC log lines examplifying the issue
>     > -----------------------------------
>     > server1:
>     > 290059.768: [GC [PSYoungGen: 1804526K->83605K(2047744K)]
>     > 4574486K->2873501K(5030720K), 0.0220940 secs] [Times: user=0.22
>     > sys=0.00, real=0.03 secs]
>     > 290297.525: [GC [PSYoungGen: 1790101K->92128K(2070592K)]
>     > 4579997K->2882024K(5053568K), 183.0278480 secs] [Times: user=0.21
>     > sys=0.12, real=183.03 secs]
>     > 290575.578: [GC [PSYoungGen: 1831520K->57896K(2057792K)]
>     > 4621416K->2892809K(5040768K), 0.0245410 secs] [Times: user=0.26
>     > sys=0.01, real=0.03 secs]
>     >
>     > server2:
>     > 15535.809: [GC 15535.809: [ParNew: 2233185K->295969K(2340032K),
>     > 0.0327680 secs] 4854859K->2917643K(5313116K), 0.0328570 secs]
>     [Times:
>     > user=0.37 sys=0.00, real=0.03 secs]
>     > 15628.580: [GC 15628.580: [ParNew: 2246049K->301210K(2340032K),
>     > 185.3817530 secs] 4867723K->2944629K(5313116K), 185.3836340 secs]
>     > [Times: user=0.38 sys=0.16, real=185.38 secs]
>     > 15821.983: [GC 15821.983: [ParNew: 2251290K->341977K(2340032K),
>     > 0.0513300 secs] 4894709K->3012947K(5313116K), 0.0514380 secs]
>     [Times:
>     > user=0.48 sys=0.03, real=0.05 secs]
>     >
>     > If interested, I have the full logs - the issue appears 4-5
>     times in a
>     > day.
>     >
>     > We have tried tuning the GC options of our server application,
>     so far
>     > without success - we still get the same long collections. We suspect
>     > this happens because of the 16 core machine, because when we perform
>     > the same test on a 8 core machine, we do not observe the long
>     garbage
>     > collections at all.
>     >
>     > We also tried both the CMS and the ParallelGC algorithms - we
>     get the
>     > delays on the 16 core machine in both cases, and in both cases the
>     > server works OK on the 8 core machine.
>     >
>     > * Would you have a look what goes on? Below I post the information
>     > about the OS, Java version, application and HW. Attached find the GC
>     > logs from two servers - one running the CMS and the other running
>     > ParallelGC.
>     >
>     > OS
>     > --
>     > Ubuntu 9.04.3 LTS
>     >
>     > uname -a
>     > Linux fas1 2.6.24-25-server #1 SMP Tue Oct 20 07:20:02 UTC 2009
>     x86_64
>     > GNU/Linux
>     >
>     > HW
>     > --
>     > POWEREDGE R710 RACK CHASSIS with 4X INTEL XEON E5520 PROCESSOR
>     2.26GHZ
>     > 12GB MEMORY (6X2GB DUAL RANK)
>     > 2 x 300GB SAS 15K 3.5" HD HOT PLUG IN RAID-1
>     > PERC 6/I RAID CONTROLLER CARD 256MB PCIE
>     >
>     > Java
>     > ----
>     > java version "1.6.0_16"
>     > Java(TM) SE Runtime Environment (build 1.6.0_16-b01)
>     > Java HotSpot(TM) 64-Bit Server VM (build 14.2-b01, mixed mode)
>     >
>     > VM settings server1
>     > -------------------
>     > -server -Xms2461m -Xmx7000m -XX:MaxPermSize=128m -XX:+UseParallelGC
>     > -XX:+UseParallelOldGC -XX:ParallelGCThreads=10 -XX:MaxNewSize=2333m
>     > -XX:NewSize=2333m -XX:SurvivorRatio=5 -XX:+DisableExplicitGC
>     > -XX:+UseBiasedLocking -XX:+UseMembar -verbosegc -XX:+PrintGCDetails
>     >
>     > VM settings server2
>     > -------------------
>     > -server -Xms2794m -Xmx8000m -XX:MaxPermSize=128m
>     > -XX:+CMSClassUnloadingEnabled -XX:+UseConcMarkSweepGC
>     -XX:+UseParNewGC
>     > -XX:+CMSParallelRemarkEnabled -XX:MaxNewSize
>     > =2666m -XX:SurvivorRatio=5 -XX:+DisableExplicitGC
>     > -XX:+UseBiasedLocking -XX:+UseMembar -verbosegc -XX:+PrintGCDetails
>     >
>     > We have also observed the issue without all GC-related settings,
>     thus
>     > the VM running with defaults - same result.
>     >
>     > Application:
>     > -----------
>     > Our application runs in a JBoss container and processes http
>     requests.
>     > We use up to 50 processing threads. We perform about 10-20 request
>     > processings per second during the executed test that exposed the
>     > issue. Our server produces quite substantial amount of garbage.
>     >
>     > Yours sincerely,
>     >   Nikolay Diakov
>     >
>     >------------------------------------------------------------------------
>     >
>     >_______________________________________________
>     >hotspot-gc-use mailing list
>     >hotspot-gc-use at openjdk.java.net
>     <mailto:hotspot-gc-use at openjdk.java.net>
>     >http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>     >
>     >
>
>     _______________________________________________
>     hotspot-gc-use mailing list
>     hotspot-gc-use at openjdk.java.net
>     <mailto:hotspot-gc-use at openjdk.java.net>
>     http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>
>


From nikolay.diakov at fredhopper.com  Tue Jan 12 01:32:55 2010
From: nikolay.diakov at fredhopper.com (Nikolay Diakov)
Date: Tue, 12 Jan 2010 10:32:55 +0100
Subject: unusually long GC latencies
In-Reply-To: <4B4B593A.1030705@sun.com>
References: <4B4B535C.50704@fredhopper.com> <4B4B593A.1030705@sun.com>
Message-ID: <4B4C41C7.30406@fredhopper.com>

Thanks! We will examine the servers closer for swapping events.

--N

On 11-01-10 18:00, Jon Masamitsu wrote:
> Nikolay,
>
> The line
>
> 290297.525: [GC [PSYoungGen: 1790101K->92128K(2070592K)]
> 4579997K->2882024K(5053568K), 183.0278480 secs] [Times: user=0.21
> sys=0.12, real=183.03 secs]
>
> says that the user time is 0.21 (which is in line with the collections
> before and after), the system time in 0.12 which is a jump up from
> the before and after collections).  And that the real time is 183.02
> which is the problem.    I would guess that the process is waiting
> for something.   Might it be swapping?
>
> Jon
>
> Nikolay Diakov wrote On 01/11/10 08:35,:
>
>    
>> Dear all,
>>
>> We have a server application, which under some load seems to do some
>> unusually long garbage collections that our clients experience as a
>> service denied for 2+ minutes around each moment of the long garbage
>> collection. Upon further look we see in the GC log that these happen
>> during young generation cleanups:
>>
>> GC log lines examplifying the issue
>> -----------------------------------
>> server1:
>> 290059.768: [GC [PSYoungGen: 1804526K->83605K(2047744K)]
>> 4574486K->2873501K(5030720K), 0.0220940 secs] [Times: user=0.22
>> sys=0.00, real=0.03 secs]
>> 290297.525: [GC [PSYoungGen: 1790101K->92128K(2070592K)]
>> 4579997K->2882024K(5053568K), 183.0278480 secs] [Times: user=0.21
>> sys=0.12, real=183.03 secs]
>> 290575.578: [GC [PSYoungGen: 1831520K->57896K(2057792K)]
>> 4621416K->2892809K(5040768K), 0.0245410 secs] [Times: user=0.26
>> sys=0.01, real=0.03 secs]
>>
>> server2:
>> 15535.809: [GC 15535.809: [ParNew: 2233185K->295969K(2340032K),
>> 0.0327680 secs] 4854859K->2917643K(5313116K), 0.0328570 secs] [Times:
>> user=0.37 sys=0.00, real=0.03 secs]
>> 15628.580: [GC 15628.580: [ParNew: 2246049K->301210K(2340032K),
>> 185.3817530 secs] 4867723K->2944629K(5313116K), 185.3836340 secs]
>> [Times: user=0.38 sys=0.16, real=185.38 secs]
>> 15821.983: [GC 15821.983: [ParNew: 2251290K->341977K(2340032K),
>> 0.0513300 secs] 4894709K->3012947K(5313116K), 0.0514380 secs] [Times:
>> user=0.48 sys=0.03, real=0.05 secs]
>>
>> If interested, I have the full logs - the issue appears 4-5 times in a
>> day.
>>
>> We have tried tuning the GC options of our server application, so far
>> without success - we still get the same long collections. We suspect
>> this happens because of the 16 core machine, because when we perform
>> the same test on a 8 core machine, we do not observe the long garbage
>> collections at all.
>>
>> We also tried both the CMS and the ParallelGC algorithms - we get the
>> delays on the 16 core machine in both cases, and in both cases the
>> server works OK on the 8 core machine.
>>
>> * Would you have a look what goes on? Below I post the information
>> about the OS, Java version, application and HW. Attached find the GC
>> logs from two servers - one running the CMS and the other running
>> ParallelGC.
>>
>> OS
>> --
>> Ubuntu 9.04.3 LTS
>>
>> uname -a
>> Linux fas1 2.6.24-25-server #1 SMP Tue Oct 20 07:20:02 UTC 2009 x86_64
>> GNU/Linux
>>
>> HW
>> --
>> POWEREDGE R710 RACK CHASSIS with 4X INTEL XEON E5520 PROCESSOR 2.26GHZ
>> 12GB MEMORY (6X2GB DUAL RANK)
>> 2 x 300GB SAS 15K 3.5" HD HOT PLUG IN RAID-1
>> PERC 6/I RAID CONTROLLER CARD 256MB PCIE
>>
>> Java
>> ----
>> java version "1.6.0_16"
>> Java(TM) SE Runtime Environment (build 1.6.0_16-b01)
>> Java HotSpot(TM) 64-Bit Server VM (build 14.2-b01, mixed mode)
>>
>> VM settings server1
>> -------------------
>> -server -Xms2461m -Xmx7000m -XX:MaxPermSize=128m -XX:+UseParallelGC
>> -XX:+UseParallelOldGC -XX:ParallelGCThreads=10 -XX:MaxNewSize=2333m
>> -XX:NewSize=2333m -XX:SurvivorRatio=5 -XX:+DisableExplicitGC
>> -XX:+UseBiasedLocking -XX:+UseMembar -verbosegc -XX:+PrintGCDetails
>>
>> VM settings server2
>> -------------------
>> -server -Xms2794m -Xmx8000m -XX:MaxPermSize=128m
>> -XX:+CMSClassUnloadingEnabled -XX:+UseConcMarkSweepGC -XX:+UseParNewGC
>> -XX:+CMSParallelRemarkEnabled -XX:MaxNewSize
>> =2666m -XX:SurvivorRatio=5 -XX:+DisableExplicitGC
>> -XX:+UseBiasedLocking -XX:+UseMembar -verbosegc -XX:+PrintGCDetails
>>
>> We have also observed the issue without all GC-related settings, thus
>> the VM running with defaults - same result.
>>
>> Application:
>> -----------
>> Our application runs in a JBoss container and processes http requests.
>> We use up to 50 processing threads. We perform about 10-20 request
>> processings per second during the executed test that exposed the
>> issue. Our server produces quite substantial amount of garbage.
>>
>> Yours sincerely,
>>    Nikolay Diakov
>>
>> ------------------------------------------------------------------------
>>
>> _______________________________________________
>> hotspot-gc-use mailing list
>> hotspot-gc-use at openjdk.java.net
>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>>
>>
>>      
>    


From shaun.hennessy at alcatel-lucent.com  Fri Jan 15 12:02:07 2010
From: shaun.hennessy at alcatel-lucent.com (Shaun Hennessy)
Date: Fri, 15 Jan 2010 15:02:07 -0500
Subject: CMS & DefaultMaxTenuringThreshold/SurvivorRatio
In-Reply-To: <4B4BA6AF.5080300@sun.com>
References: <dc6011320909281007k19f52f54v8d086edcd19b0bae@mail.gmail.com>
	<4AC0EEAE.5010705@Sun.COM>
	<dc6011320909281106r398fa9fak9384e89012d0c52d@mail.gmail.com>
	<4AC145AE.30804@sun.com> <4AC148B6.7010608@Sun.COM>
	<4AC1493A.2030004@sun.com> <4B4B3ECB.5090105@alcatel-lucent.com>
	<4B4B937C.4080907@alcatel-lucent.com>
	<4B4B9FBC.4040103@sun.com> <4B4BA6AF.5080300@sun.com>
Message-ID: <4B50C9BF.8060202@alcatel-lucent.com>

An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20100115/1ca2d141/attachment.html 

From Y.S.Ramakrishna at Sun.COM  Fri Jan 15 13:31:01 2010
From: Y.S.Ramakrishna at Sun.COM (Y. Srinivas Ramakrishna)
Date: Fri, 15 Jan 2010 13:31:01 -0800
Subject: CMS & DefaultMaxTenuringThreshold/SurvivorRatio
In-Reply-To: <4B50C9BF.8060202@alcatel-lucent.com>
References: <dc6011320909281007k19f52f54v8d086edcd19b0bae@mail.gmail.com>
	<4AC0EEAE.5010705@Sun.COM>
	<dc6011320909281106r398fa9fak9384e89012d0c52d@mail.gmail.com>
	<4AC145AE.30804@sun.com> <4AC148B6.7010608@Sun.COM>
	<4AC1493A.2030004@sun.com> <4B4B3ECB.5090105@alcatel-lucent.com>
	<4B4B937C.4080907@alcatel-lucent.com>
	<4B4B9FBC.4040103@sun.com> <4B4BA6AF.5080300@sun.com>
	<4B50C9BF.8060202@alcatel-lucent.com>
Message-ID: <4B50DE95.4060901@Sun.COM>

Hi Shaun --

On 01/15/10 12:02, Shaun Hennessy wrote:
> Yes thanks I see the difference with the PrintTenuringDistribution --
> the _max_ is 4 while the _actual_ threshold varies from 1-4.
> 
> 
> I'd like to make sure I've on the right thinking here....
> here are 2 runs using no-survivor or survivor 
> (everything else the same, loads should be pretty close, but not identical)
> 
> 1) No Survivor + with a code fix ; uptime 3:31 (211min)
> MINOR 322 collections; 3m52s (232 seconds)
> (91.56 collections/hour) (0.72 seconds/collection
> = 65.9sec/hour on minor's = 1.8% of total time
> MAJOR 18 collections; 1m5s
> (5.12 collections/hour) (3.61 seconds/collection)
> = 18.48sec/hour =  0.51% of total time
> 
> 2) Survivor + a code fix;  uptime 3h14 (194min)
> MINOR  346 collections; 3m40s  (220 seconds)
> (107.01 collections/hour) (1.57 seconds/collection)
> = 167.99sec/hour  = 4.66% of total time
> MAJOR (aka Major) 8 collections; 16.7 sec
> (2.47 collections/hour) (2.08 seconds/collection)
> = 5.12 sec/hour =  0.14% of total time
> 
> So by using survivor spaces we're reduced the frequency and duration of 
> Major collections which was our goal,
> and is the general expected result when going from no-survivor to 
> survivor as we'll be kicking less up
> to the tenured space. Additionally we've increased the frequency of our 
> minor collections (again as expected
> as our Eden shrunk from 4GB  to 3GB (with 1GB for survivor), and our 
> duration has also increased because
> now we're doing more copying around between survivor spaces -- again 
> everything is as expected.
> Everything I've said so far correct?

Correct.

> 
> 
> Throwing in one more scenario, I now remove the code fix.  The code fix 
> was improving a method
> so it no longer allocated memory every invocation by instead re-using a 
> static threadlocal memory.
> This was one of top methods allocating useless memory, memory that I 
> expect would have died very quickly.  
> Now we have the following, again all parameters are the same, just 
> removed the fix,
> loads should be pretty close, but not exact
> 
> 3) Survivor, *NO Code Fix*; uptime 3h28min (208min)
> MINOR 432collections, 4m17s (257 second)
> *(123.61 collections/hour) (0.59 seconds/collection*
> = 72.92sec/hour on minor's = 2.02% of total time
> 
> MAJOR 12 collections, 25.387s
> (3.46 collections/hour) (2.11 seconds/collection)
> = 7.32 sec/hour =  0.20 % of total time
> 
> So comparing 2) and 3)
> Alright so now I'm having more minor collections without the code fix 
> compared to with the fix, - which would be expected as
> we aren't allocating every time we hit the method.   BUT -  these minor 
> collections are much quicker without code-fix than with the
> code fix -- presumably because this results in a greater % of objects 
> that are still living each collection that must
> be copied around a few times?   

Possibly, depending on the lifetime of these objects.

> 
> It's nice that the fix has resulted in less frequent major's - I am 
> guessing it's because by filling up the young generation
> less quickly means more time between collections which means more 
> objects have a chance to die -- but it seems
> to be quite the hit on the minor collection time to achieve this.  

One is basically trading off copying between survivor spaces
versus the cost of dealing with the garbage in the old generation
(or of premature tenuring causing other objects to stick around
as floating garbage).

> 
> I've written a d-trace script which tracks all our method's memory 
> allocation and we were going to address
> some of the top hitters as we did on the first one.  I guess my ultimate 
> question is if we start eliminating some of the
> low-hanging fruit memory allocating, most of which it is likely to be 
> that which would have died quickly --- how should we be tuning?
> Do we need to "lower" our tenuring threshold or possibly go back to not 
> using Survivor space?

Usually, getting rid of short-lived object allocation is unlikely to
provide big benefits because those are easiest to collect with a
copying collector.

Reducing long- and medium-lived objects might provide greater
benefits.

> It almost seems like we may have to choose to use survivor space OR we 
> can try to stop allocating as much memory
> -- but trying do both may be counter productive / make minor collection 
> times (and more importantly throughput of the application) unacceptable?

Not really. Allocating fewer objects will always be superior to
allocating more, no matter their lifetimes. But when you do that,
adjusting your tenuring threshold so as not to cause useless
copying of medium-lived objects between the survivor spaces
is important, especially if there is not much pressure on the
old generation collections.

> The original goal was to eliminate long major pauses, but we can't 
> completely ignore throughput....

I do not see this as the kind of choice you describe above.
Rather it comes down to setting a suitable tenuring threshold.
It is true though that if you have eliminated almost all medium-lived
objects, then setting MaxTenuringThreshold=1 will give the best
performance. (In most cases I have seen, completely doing away
with survivor spaces and using MaxTenuringThreshold=0 does not
seem to work as well.)

+PrintTenuringDistribution should let you find the "knee" of the
curve which will tell you what your optimal MaxTenuringDistribution
would be (i.e. beyond which the copying between survivor spaces
yileds no benefits).

cheers.
-- ramki

> 
> 
> thanks,
> Shaun
> 
> Y. Srinivas Ramakrishna wrote:
>> Hi Shaun --
>>
>> Jon Masamitsu wrote:
>>>
>>> Only the througput collector has GC ergonomics implemented.  That's
>>> the feature that would vary eden vs survivor sizes and the tenuring
>>> threshold.
>>
>> Just to clarify, that should read "_max_ tenuring threshold" above.
>> The tenuring threshold itself is indeed adaptively varied from
>> one scavenge to the next (based on survivor space size and object
>> survival demographics, using Ungar's adaptive tenuring algorithm)
>> by CMS and the serial collector. A different scheme determine the
>> tenuring threshold used per scavenge by Parallel GC.
>> Ask again if you want to know the difference between per-scavenge
>> tenuring threshold (which is adaptively varied) and
>> _max_ tenuring threshold (which is spec'd on the command-line).
>>
>> But yes the rest of the things, heap size, shape, and _max_ tenuring 
>> threshold
>> would need to be manually tuned for optimal performance of CMS. Read 
>> the GC
>> tuning guide for how you might tune the survivor space size and
>> max tenuring threshold for your application using 
>> PrintTenuringDistribution data.
>>
>>>
>>>> Also still curious if XX:ParallelGCThreads should be set to 16
>>>> (#cpus)-- if my desire is to minimize time spent
>>>> in STW GC time?
>>>
>>>
>>> Yes 16 on average will minimize the STW GC pauses but occasionally
>>> (pretty rare actually), there can be some interaction between the GC 
>>> using
>>> all the hardware threads and the OS needing one.
>>
>> I have occasionally found that unless you have really large Eden 
>> sizes, fewer
>> GC threads than CPU's often give you the best results. But yes you are 
>> in the
>> right ball-park by growing the number of GC threads as your cpu count, 
>> cache size per cpu
>> and heap size increase. With a 4GB young gen as you have, i'd try 8 
>> through 16
>> gc threads to see what works best.
>>
>> -- ramki
> 


From chkwok at digibites.nl  Sat Jan 16 17:06:44 2010
From: chkwok at digibites.nl (Chi Ho Kwok)
Date: Sun, 17 Jan 2010 02:06:44 +0100
Subject: GC_Locker turning CMS into a stop-the-world collector
In-Reply-To: <1b9d6f691001161704x955d00drcf1b126565db43d0@mail.gmail.com>
References: <1b9d6f691001161704x955d00drcf1b126565db43d0@mail.gmail.com>
Message-ID: <1b9d6f691001161706i4cf4c0f2ve2535b7e505ad08a@mail.gmail.com>

Hmm, where to begin...

Today, I started tracing a weird, periodic performance problem with a
jetty based server. The app uses quite a bit of memory, so it's
running with a CMS collector and a 16G heap. We've noticed some weird,
random 20-30 seconds latency, which occurs about once an hour, so I've
added some diagnostics and most importantly, a way to call Thread.
getAllStackTraces() remotely - and captured a live trace of the 'bug'.

So I've got almost every thread (~60) on the system blocked on
? ? ? at java.nio.Bits.copyToByteArray(Bits.java:?)
? ? ? at java.nio.DirectByteBuffer.get(DirectByteBuffer.java:224)
? ? ? at java.nio.HeapByteBuffer.put(HeapByteBuffer.java:191)
? ? ? at sun.nio.ch.IOUtil.read(IOUtil.java:209)
? ? ? at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:236)
? ? ? at org.mortbay.io.nio.ChannelEndPoint.fill(ChannelEndPoint.java:131)
? ? ? at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:290)
? ? ? at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)
? ? ? at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:405)
? ? ? at org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:409)
? ? ? at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)

And a few other threads are blocked via different paths on copyTo/FromByteArray.

What's happening?!

Looking through the code, tracing from copyToByteArray to
jni_GetPrimitiveArrayCritical, GC_locker::lock_critical(thread) in
jni.cpp:2546 looks like the cause of all evil. But the bug doesn't
trigger on every GC, nope, so this alone won't cause everything to get
stuck. We have a minor collection every few seconds and a major one
every minute, but the problem is more rare, it occurs about once per
hour, but can be absent for up to half a day when you're lucky. But it
has the annoying tendency to occur when it's unusually busy.

To understand what the heck is happening, I've opened
gcLocker.cpp/hpp/.inline.hpp, and tried to find the exact condition
how this can happen. lock_critical() has two paths - a fast path when
there's enough memory and needs_gc() returns false, or a slow path
which *blocks the thread* on JNICritical_lock->wait() if needs_gc() or
_doing_gc is true. It's the same for unlock_critical, if needs_gc() is
set, it calls the _slow() version.

All the puzzle pieces are here now, all I needed to do is bring them
all together in a scenario:

1. A thread enters a critical section, calls lock_critical(), uses the fast path
2. needs_gc() goes from false to true, because old generation
utilization is above the collection threshold. CMS background
collection won't start, yet, because GC_Locker's _jni_lock_count is >
0.*
3. Thread exits critical section. needs_gc() is true, so it uses
jni_unlock_slow()
4. In jni_unlock_slow(), this is the last thread out. That will set
_doing_gc to true, and block all future attempts on lock_critical()
until that flag is cleared **
5. Do a full collection by calling
Universe::heap()->collect(GCCause::_gc_locker). This take "forever"
***
6. Set _doing_gc to false again via clear_needs_gc()
7. Rejoice! JNICritical_lock->notify_all() is called! And watch the
load go to 30+ because the server has been doing nothing for ages.

*: speculation. Didn't actually find the line calling set_needs_gc in
vm/gc_implementation/*, but I assume it's set from there, somewhere.
**: I've seen a few threads that were blocked on a copyToByteArray in
a MappedByteBuffer that has been .load()-ed, so it must be a lock. A
simple memcpy cannot take this long.
***: speculation. Didn't actually measure the time, but because every
single thread in the system is trapped in JNICritical_lock->wait(), I
assume this part takes forever.


So the conclusion is: using a channel to handle HTTP requests combined
with a large heap is suicidal at the moment. Every channel.read() call
will block when GC_Locker initiates a CMS collection, blocking every
single request until the foreground collection is done. That's because
SocketChannelImpl.read() calls copyToByteArray internally, and it will
block during a GC_Locker initiated CMS-collection.


I'm just a lowly java programmer, so I've no idea how to fix this. My
workaround now is just to stop using a SelectChannelConnector and
switch back to a ServerSocketConnector in Jetty, praying that the
other nio calls I have in the code (MappedByteBuffer.read() is
semi-frequently used) won't turn CMS into a very slow, almost serial
stop the world collector.


I kinda doubt that this is intended, so, should I file this as a bug?
I've (mis)filed it already as #1695140, but the analysis is wrong
there, blaming it on "some kind of spinlock?", and as I thought it was
unrelated to gc, no jvm arguments were attached. Well, it's a lock,
but it's quite a bit more complex than I assumed.


Any help would be appreciated. I've got no experience at all in
hacking jvm's, but I can try to create a test case to trigger this
situation manually and try different builds of the jvm.


Attached is the full stack trace of all threads during the 'lag
spike'. Note that the stack dump is triggered over http, using java.io
/ streams; that one kept working while all requests on a channel were
blocked. "InstrumentedConnector" is just simple a subclass of the
default jetty SelectChannelConnector, with extra stats to help
figuring out what's happening.


Chi Ho Kwok
-------------- next part --------------
Format is: list of threads with the same stack trace, followed by the actual stack trace

2010-01-16 18:34:03.102240 - 7.39908218384 ms
{'priority': 1, 'state': 'RUNNABLE', 'name': '1468041408 at qtp-1986936160-28'}
{'priority': 1, 'state': 'RUNNABLE', 'name': '962687369 at qtp-1986936160-30'}
{'priority': 1, 'state': 'RUNNABLE', 'name': '1506392522 at qtp-1986936160-52'}
{'priority': 1, 'state': 'RUNNABLE', 'name': '1728081269 at qtp-1986936160-3'}
{'priority': 1, 'state': 'RUNNABLE', 'name': '1352230939 at qtp-1986936160-39'}
{'priority': 1, 'state': 'RUNNABLE', 'name': '1663906309 at qtp-1986936160-40'}
{'priority': 1, 'state': 'RUNNABLE', 'name': '411509632 at qtp-1986936160-31'}
{'priority': 1, 'state': 'RUNNABLE', 'name': '148376547 at qtp-1986936160-2'}
{'priority': 1, 'state': 'RUNNABLE', 'name': '413536612 at qtp-1986936160-17'}
{'priority': 1, 'state': 'RUNNABLE', 'name': '761700907 at qtp-1986936160-44'}
{'priority': 1, 'state': 'RUNNABLE', 'name': '1222033772 at qtp-1986936160-6'}
{'priority': 1, 'state': 'RUNNABLE', 'name': '1361235382 at qtp-1986936160-13'}
{'priority': 1, 'state': 'RUNNABLE', 'name': '405642246 at qtp-1986936160-34'}
{'priority': 1, 'state': 'RUNNABLE', 'name': '623839641 at qtp-1986936160-14'}
{'priority': 1, 'state': 'RUNNABLE', 'name': '870010735 at qtp-1986936160-7'}
{'priority': 1, 'state': 'RUNNABLE', 'name': '1027818036 at qtp-1986936160-1'}
{'priority': 1, 'state': 'RUNNABLE', 'name': '981863753 at qtp-1986936160-24'}
{'priority': 1, 'state': 'RUNNABLE', 'name': '555551311 at qtp-1986936160-18'}
{'priority': 1, 'state': 'RUNNABLE', 'name': '1526644999 at qtp-1986936160-29'}
{'priority': 1, 'state': 'RUNNABLE', 'name': '267273800 at qtp-1986936160-37'}
{'priority': 1, 'state': 'RUNNABLE', 'name': '1476473244 at qtp-1986936160-58'}
{'priority': 1, 'state': 'RUNNABLE', 'name': '172647384 at qtp-1986936160-20'}
{'priority': 1, 'state': 'RUNNABLE', 'name': '1357862146 at qtp-1986936160-10'}
{'priority': 1, 'state': 'RUNNABLE', 'name': '2773808 at qtp-1986936160-5'}
{'priority': 1, 'state': 'RUNNABLE', 'name': '2030814365 at qtp-1986936160-51'}
{'priority': 1, 'state': 'RUNNABLE', 'name': '1838022392 at qtp-1986936160-4'}
{'priority': 1, 'state': 'RUNNABLE', 'name': '1551156138 at qtp-1986936160-23'}
{'priority': 1, 'state': 'RUNNABLE', 'name': '1169983611 at qtp-1986936160-43'}
{'priority': 1, 'state': 'RUNNABLE', 'name': '2018812712 at qtp-1986936160-56'}
{'priority': 1, 'state': 'RUNNABLE', 'name': '1209719856 at qtp-1986936160-41'}
{'priority': 1, 'state': 'RUNNABLE', 'name': '1032121412 at qtp-1986936160-38'}
{'priority': 1, 'state': 'RUNNABLE', 'name': '1813126941 at qtp-1986936160-26'}
{'priority': 1, 'state': 'RUNNABLE', 'name': '1144967167 at qtp-1986936160-15'}
{'priority': 1, 'state': 'RUNNABLE', 'name': '1765421434 at qtp-1986936160-47'}
{'priority': 1, 'state': 'RUNNABLE', 'name': '43086831 at qtp-1986936160-21'}
{'priority': 1, 'state': 'RUNNABLE', 'name': '703447155 at qtp-1986936160-25'}
{'priority': 1, 'state': 'RUNNABLE', 'name': '1382393727 at qtp-1986936160-16'}
{'priority': 1, 'state': 'RUNNABLE', 'name': '1393665909 at qtp-1986936160-19'}
{'priority': 1, 'state': 'RUNNABLE', 'name': '618846953 at qtp-1986936160-11'}
{'priority': 1, 'state': 'RUNNABLE', 'name': '821556544 at qtp-1986936160-9'}
{'priority': 1, 'state': 'RUNNABLE', 'name': '2026789660 at qtp-1986936160-54'}
{'priority': 1, 'state': 'RUNNABLE', 'name': '281555666 at qtp-1986936160-42'}
{'priority': 1, 'state': 'RUNNABLE', 'name': '9814147 at qtp-1986936160-33'}
{'priority': 1, 'state': 'RUNNABLE', 'name': '1295986757 at qtp-1986936160-0'}
{'priority': 1, 'state': 'RUNNABLE', 'name': '1221696456 at qtp-1986936160-48'}
{'priority': 1, 'state': 'RUNNABLE', 'name': '303731508 at qtp-1986936160-61'}
{'priority': 1, 'state': 'RUNNABLE', 'name': '949026880 at qtp-1986936160-27'}
{'priority': 1, 'state': 'RUNNABLE', 'name': '961725657 at qtp-1986936160-22'}
{'priority': 1, 'state': 'RUNNABLE', 'name': '900409598 at qtp-1986936160-53'}
{'priority': 1, 'state': 'RUNNABLE', 'name': '1145518399 at qtp-1986936160-45'}
{'priority': 1, 'state': 'RUNNABLE', 'name': '305035296 at qtp-1986936160-60'}
{'priority': 1, 'state': 'RUNNABLE', 'name': '1594958326 at qtp-1986936160-8'}
{'priority': 1, 'state': 'RUNNABLE', 'name': '1752918153 at qtp-1986936160-35'}
{'priority': 1, 'state': 'RUNNABLE', 'name': '1034573819 at qtp-1986936160-46'}
{'priority': 1, 'state': 'RUNNABLE', 'name': '1377877625 at qtp-1986936160-36'}
{'priority': 1, 'state': 'RUNNABLE', 'name': '1535043768 at qtp-1986936160-50'}
{'priority': 1, 'state': 'RUNNABLE', 'name': '1702714666 at qtp-1986936160-32'}
{'priority': 1, 'state': 'RUNNABLE', 'name': '1649966228 at qtp-1986936160-59'}
{'priority': 1, 'state': 'RUNNABLE', 'name': '852031224 at qtp-1986936160-12'}
	at java.nio.Bits.copyToByteArray(Bits.java:?)
	at java.nio.DirectByteBuffer.get(DirectByteBuffer.java:224)
	at java.nio.HeapByteBuffer.put(HeapByteBuffer.java:191)
	at sun.nio.ch.IOUtil.read(IOUtil.java:209)
	at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:236)
	at org.mortbay.io.nio.ChannelEndPoint.fill(ChannelEndPoint.java:131)
	at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:290)
	at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)
	at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:405)
	at org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:409)
	at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)
{'priority': 5, 'state': 'TIMED_WAITING', 'name': 'Timer-0'}
{'priority': 5, 'state': 'TIMED_WAITING', 'name': 'Timer-1'}
	at java.lang.Object.wait(Object.java:?)
	at java.util.TimerThread.mainLoop(Timer.java:509)
	at java.util.TimerThread.run(Timer.java:462)
{'priority': 5, 'state': 'RUNNABLE', 'name': '1720122918 at qtp-1986936160-66 - Acceptor0 InstrumentedConnector at 0.0.0.0:1247'}
{'priority': 5, 'state': 'RUNNABLE', 'name': '1833701635 at qtp-1986936160-64 - Acceptor0 InstrumentedConnector at 0.0.0.0:1249'}
{'priority': 5, 'state': 'RUNNABLE', 'name': '569616903 at qtp-1986936160-68 - Acceptor0 InstrumentedConnector at 0.0.0.0:1245'}
{'priority': 5, 'state': 'RUNNABLE', 'name': '2078955121 at qtp-1986936160-67 - Acceptor0 InstrumentedConnector at 0.0.0.0:1246'}
{'priority': 5, 'state': 'RUNNABLE', 'name': '391717236 at qtp-1986936160-65 - Acceptor0 InstrumentedConnector at 0.0.0.0:1248'}
	at sun.nio.ch.EPollArrayWrapper.epollWait(EPollArrayWrapper.java:?)
	at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:215)
	at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65)
	at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69)
	at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80)
	at org.mortbay.io.nio.SelectorManager$SelectSet.doSelect(SelectorManager.java:459)
	at org.mortbay.io.nio.SelectorManager.doSelect(SelectorManager.java:192)
	at org.mortbay.jetty.nio.SelectChannelConnector.accept(SelectChannelConnector.java:124)
	at org.mortbay.jetty.AbstractConnector$Acceptor.run(AbstractConnector.java:707)
	at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)
{'priority': 1, 'state': 'RUNNABLE', 'name': '1949571424 at qtp-1986936160-55'}
{'priority': 1, 'state': 'RUNNABLE', 'name': '1743299062 at qtp-1986936160-57'}
	at java.nio.Bits.copyToByteArray(Bits.java:?)
	at java.nio.DirectByteBuffer.get(DirectByteBuffer.java:224)
	at com.wol3.server.model.loader.Loader.getUnzippedByteBuffer(Loader.java:174)
	at com.wol3.server.model.loader.Loader.load0(Loader.java:90)
	at com.wol3.server.model.loader.Loader.load(Loader.java:68)
	at com.wol3.server.model.loader.Loader.load(Loader.java:197)
	at com.wol3.server.model.loader.CachedLoader.getCombatLog(CachedLoader.java:95)
	at com.wol3.server.http.handler.DataAPIServlet.doGet(DataAPIServlet.java:46)
	at com.wol3.server.http.handler.DataAPIServlet.doPost(DataAPIServlet.java:24)
	at javax.servlet.http.HttpServlet.service(HttpServlet.java:727)
	at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
	at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511)
	at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:390)
	at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
	at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:765)
	at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
	at org.mortbay.jetty.Server.handle(Server.java:326)
	at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:536)
	at org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:930)
	at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:747)
	at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:218)
	at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:405)
	at org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:409)
	at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)
{'priority': 5, 'state': 'WAITING', 'name': 'main'}
	at java.lang.Object.wait(Object.java:?)
	at java.lang.Object.wait(Object.java:485)
	at org.mortbay.thread.QueuedThreadPool.join(QueuedThreadPool.java:298)
	at org.mortbay.jetty.Server.join(Server.java:332)
	at com.wol3.server.http.HttpServer.join(HttpServer.java:140)
	at com.wol3.server.http.HttpServer.main(HttpServer.java:146)
{'priority': 5, 'state': 'TIMED_WAITING', 'name': 'CachedLoader-GC2-daemon'}
	at java.lang.Thread.sleep(Thread.java:?)
	at com.wol3.server.model.loader.CachedLoader$1.run(CachedLoader.java:35)
	at java.lang.Thread.run(Thread.java:619)
{'priority': 10, 'state': 'WAITING', 'name': 'Reference Handler'}
	at java.lang.Object.wait(Object.java:?)
	at java.lang.Object.wait(Object.java:485)
	at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116)
{'priority': 9, 'state': 'RUNNABLE', 'name': 'Signal Dispatcher'}
{'priority': 5, 'state': 'TIMED_WAITING', 'name': 'Thread-74'}
	at java.lang.Thread.sleep(Thread.java:?)
	at com.wol3.server.model.loader.CachedLoader$2.run(CachedLoader.java:59)
	at java.lang.Thread.run(Thread.java:619)
{'priority': 8, 'state': 'WAITING', 'name': 'Finalizer'}
	at java.lang.Object.wait(Object.java:?)
	at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:118)
	at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:134)
	at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159)
{'priority': 10, 'state': 'RUNNABLE', 'name': '808460461 at qtp-274617771-0'}
	at java.lang.Thread.dumpThreads(Thread.java:?)
	at java.lang.Thread.getAllStackTraces(Thread.java:1487)
	at com.wol3.server.http.handler.SystemServlet.doGet(SystemServlet.java:88)
	at javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
	at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
	at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511)
	at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:390)
	at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
	at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:765)
	at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
	at org.mortbay.jetty.Server.handle(Server.java:326)
	at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:536)
	at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:915)
	at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:539)
	at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)
	at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:405)
	at org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228)
	at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)
{'priority': 1, 'state': 'RUNNABLE', 'name': '458505352 at qtp-1986936160-63'}
	at java.nio.Bits.copyToByteArray(Bits.java:?)
	at java.nio.DirectByteBuffer.get(DirectByteBuffer.java:224)
	at com.wol3.server.data.ChunkedBCLReader$2.read(ChunkedBCLReader.java:119)
	at java.util.zip.InflaterInputStream.fill(InflaterInputStream.java:221)
	at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:141)
	at java.io.FilterInputStream.read(FilterInputStream.java:90)
	at com.wol3.server.data.ChunkedBCLReader.inflateBuffer(ChunkedBCLReader.java:100)
	at com.wol3.server.data.ChunkedBCLReader.readFrom(ChunkedBCLReader.java:66)
	at com.wol3.server.model.loader.Loader.load0(Loader.java:96)
	at com.wol3.server.model.loader.Loader.load(Loader.java:68)
	at com.wol3.server.model.loader.Loader.load(Loader.java:197)
	at com.wol3.server.model.loader.CachedLoader.getCombatLog(CachedLoader.java:95)
	at com.wol3.server.http.handler.DataAPIServlet.doGet(DataAPIServlet.java:46)
	at com.wol3.server.http.handler.DataAPIServlet.doPost(DataAPIServlet.java:24)
	at javax.servlet.http.HttpServlet.service(HttpServlet.java:727)
	at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
	at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511)
	at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:390)
	at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
	at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:765)
	at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
	at org.mortbay.jetty.Server.handle(Server.java:326)
	at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:536)
	at org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:930)
	at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:747)
	at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:218)
	at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:405)
	at org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:409)
	at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)
{'priority': 1, 'state': 'RUNNABLE', 'name': '1298336441 at qtp-1986936160-49'}
	at java.nio.Bits.copyFromByteArray(Bits.java:?)
	at java.nio.DirectByteBuffer.put(DirectByteBuffer.java:314)
	at org.mortbay.io.nio.DirectNIOBuffer.poke(DirectNIOBuffer.java:201)
	at org.mortbay.io.nio.DirectNIOBuffer.poke(DirectNIOBuffer.java:141)
	at org.mortbay.io.AbstractBuffer.put(AbstractBuffer.java:448)
	at org.mortbay.jetty.HttpGenerator.addContent(HttpGenerator.java:148)
	at org.mortbay.jetty.AbstractGenerator$Output.write(AbstractGenerator.java:644)
	at org.mortbay.jetty.AbstractGenerator$Output.write(AbstractGenerator.java:579)
	at java.io.ByteArrayOutputStream.writeTo(ByteArrayOutputStream.java:109)
	at org.mortbay.jetty.AbstractGenerator$OutputWriter.write(AbstractGenerator.java:903)
	at java.io.PrintWriter.write(PrintWriter.java:382)
	at com.wol3.server.http.handler.DataAPIHandler.writeResponse(DataAPIHandler.java:680)
	at com.wol3.server.http.handler.DataAPIServlet.doGet(DataAPIServlet.java:70)
	at com.wol3.server.http.handler.DataAPIServlet.doPost(DataAPIServlet.java:24)
	at javax.servlet.http.HttpServlet.service(HttpServlet.java:727)
	at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
	at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511)
	at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:390)
	at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
	at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:765)
	at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
	at org.mortbay.jetty.Server.handle(Server.java:326)
	at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:536)
	at org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:930)
	at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:747)
	at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:218)
	at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:405)
	at org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:409)
	at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)
{'priority': 10, 'state': 'RUNNABLE', 'name': '1806344089 at qtp-274617771-1 - Acceptor0 SocketConnector at 0.0.0.0:1250'}
	at java.net.PlainSocketImpl.socketAccept(PlainSocketImpl.java:?)
	at java.net.PlainSocketImpl.accept(PlainSocketImpl.java:390)
	at java.net.ServerSocket.implAccept(ServerSocket.java:453)
	at java.net.ServerSocket.accept(ServerSocket.java:421)
	at org.mortbay.jetty.bio.SocketConnector.accept(SocketConnector.java:99)
	at org.mortbay.jetty.AbstractConnector$Acceptor.run(AbstractConnector.java:707)
	at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)
{'priority': 1, 'state': 'RUNNABLE', 'name': '1614281502 at qtp-1986936160-62'}
	at java.nio.Bits.copyFromByteArray(Bits.java:?)
	at java.nio.DirectByteBuffer.put(DirectByteBuffer.java:314)
	at java.nio.DirectByteBuffer.put(DirectByteBuffer.java:290)
	at sun.nio.ch.IOUtil.write(IOUtil.java:70)
	at sun.nio.ch.FileChannelImpl.write(FileChannelImpl.java:203)
	at com.wol3.server.model.postprocessing.ChartDataExporter.exportDefault(ChartDataExporter.java:291)
	at com.wol3.server.model.loader.Loader.load0(Loader.java:126)
	at com.wol3.server.model.loader.Loader.load(Loader.java:68)
	at com.wol3.server.model.loader.Loader.load(Loader.java:197)
	at com.wol3.server.model.loader.CachedLoader.getCombatLog(CachedLoader.java:95)
	at com.wol3.server.http.handler.DataAPIServlet.doGet(DataAPIServlet.java:46)
	at com.wol3.server.http.handler.DataAPIServlet.doPost(DataAPIServlet.java:24)
	at javax.servlet.http.HttpServlet.service(HttpServlet.java:727)
	at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
	at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511)
	at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:390)
	at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
	at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:765)
	at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
	at org.mortbay.jetty.Server.handle(Server.java:326)
	at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:536)
	at org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:930)
	at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:747)
	at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:218)
	at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:405)
	at org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:409)
	at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)

From Y.S.Ramakrishna at Sun.COM  Sun Jan 17 00:45:22 2010
From: Y.S.Ramakrishna at Sun.COM (Y. Srinivas Ramakrishna)
Date: Sun, 17 Jan 2010 00:45:22 -0800
Subject: GC_Locker turning CMS into a stop-the-world collector
In-Reply-To: <1b9d6f691001161706i4cf4c0f2ve2535b7e505ad08a@mail.gmail.com>
References: <1b9d6f691001161704x955d00drcf1b126565db43d0@mail.gmail.com>
	<1b9d6f691001161706i4cf4c0f2ve2535b7e505ad08a@mail.gmail.com>
Message-ID: <4B52CE22.1050703@sun.com>

Hi Chi Ho --

What's the version of the JDK you are using? Could you share a
GC log using the JVM options -XX:+PrintGCDetails -XX:+PrintGCTimeStamps,
a list of the full set of JVM options you used, and the periods when you see the
long stalls. The intent of the design is that full gc's be avoided
as much as possible and certainly not stop-world full gc's when using
CMS except under very special circumstances. Perhaps looking at the
shape of yr heap and its occupancy when you see the stop-world
full gc happen that causes the long stall will allow us to see if
you are occasionally running into that situation, and suggest
a suitable workaround, or a fix within the JVM.

That having been said, we know of a different problem that can, with
the current design cause what appear to be stalls (but are really
slow-path allocations which can be quite slow) when Eden has been
exhausted and JNI critical sections are held. There may be ways of
avoiding that situation by a suitable small tweak of the current
design for managing the interaction of JNI critical sections
and GC.

Anyway, a test case and the info I requested above should help us
diagnose the actual problem you are running into, speedily, and
thence find a workaround or fix.

-- ramki

Chi Ho Kwok wrote:
> Hmm, where to begin...
> 
> Today, I started tracing a weird, periodic performance problem with a
> jetty based server. The app uses quite a bit of memory, so it's
> running with a CMS collector and a 16G heap. We've noticed some weird,
> random 20-30 seconds latency, which occurs about once an hour, so I've
> added some diagnostics and most importantly, a way to call Thread.
> getAllStackTraces() remotely - and captured a live trace of the 'bug'.
> 
> So I've got almost every thread (~60) on the system blocked on
>       at java.nio.Bits.copyToByteArray(Bits.java:?)
>       at java.nio.DirectByteBuffer.get(DirectByteBuffer.java:224)
>       at java.nio.HeapByteBuffer.put(HeapByteBuffer.java:191)
>       at sun.nio.ch.IOUtil.read(IOUtil.java:209)
>       at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:236)
>       at org.mortbay.io.nio.ChannelEndPoint.fill(ChannelEndPoint.java:131)
>       at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:290)
>       at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)
>       at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:405)
>       at org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:409)
>       at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)
> 
> And a few other threads are blocked via different paths on copyTo/FromByteArray.
> 
> What's happening?!
> 
> Looking through the code, tracing from copyToByteArray to
> jni_GetPrimitiveArrayCritical, GC_locker::lock_critical(thread) in
> jni.cpp:2546 looks like the cause of all evil. But the bug doesn't
> trigger on every GC, nope, so this alone won't cause everything to get
> stuck. We have a minor collection every few seconds and a major one
> every minute, but the problem is more rare, it occurs about once per
> hour, but can be absent for up to half a day when you're lucky. But it
> has the annoying tendency to occur when it's unusually busy.
> 
> To understand what the heck is happening, I've opened
> gcLocker.cpp/hpp/.inline.hpp, and tried to find the exact condition
> how this can happen. lock_critical() has two paths - a fast path when
> there's enough memory and needs_gc() returns false, or a slow path
> which *blocks the thread* on JNICritical_lock->wait() if needs_gc() or
> _doing_gc is true. It's the same for unlock_critical, if needs_gc() is
> set, it calls the _slow() version.
> 
> All the puzzle pieces are here now, all I needed to do is bring them
> all together in a scenario:
> 
> 1. A thread enters a critical section, calls lock_critical(), uses the fast path
> 2. needs_gc() goes from false to true, because old generation
> utilization is above the collection threshold. CMS background
> collection won't start, yet, because GC_Locker's _jni_lock_count is >
> 0.*
> 3. Thread exits critical section. needs_gc() is true, so it uses
> jni_unlock_slow()
> 4. In jni_unlock_slow(), this is the last thread out. That will set
> _doing_gc to true, and block all future attempts on lock_critical()
> until that flag is cleared **
> 5. Do a full collection by calling
> Universe::heap()->collect(GCCause::_gc_locker). This take "forever"
> ***
> 6. Set _doing_gc to false again via clear_needs_gc()
> 7. Rejoice! JNICritical_lock->notify_all() is called! And watch the
> load go to 30+ because the server has been doing nothing for ages.
> 
> *: speculation. Didn't actually find the line calling set_needs_gc in
> vm/gc_implementation/*, but I assume it's set from there, somewhere.
> **: I've seen a few threads that were blocked on a copyToByteArray in
> a MappedByteBuffer that has been .load()-ed, so it must be a lock. A
> simple memcpy cannot take this long.
> ***: speculation. Didn't actually measure the time, but because every
> single thread in the system is trapped in JNICritical_lock->wait(), I
> assume this part takes forever.
> 
> 
> So the conclusion is: using a channel to handle HTTP requests combined
> with a large heap is suicidal at the moment. Every channel.read() call
> will block when GC_Locker initiates a CMS collection, blocking every
> single request until the foreground collection is done. That's because
> SocketChannelImpl.read() calls copyToByteArray internally, and it will
> block during a GC_Locker initiated CMS-collection.
> 
> 
> I'm just a lowly java programmer, so I've no idea how to fix this. My
> workaround now is just to stop using a SelectChannelConnector and
> switch back to a ServerSocketConnector in Jetty, praying that the
> other nio calls I have in the code (MappedByteBuffer.read() is
> semi-frequently used) won't turn CMS into a very slow, almost serial
> stop the world collector.
> 
> 
> I kinda doubt that this is intended, so, should I file this as a bug?
> I've (mis)filed it already as #1695140, but the analysis is wrong
> there, blaming it on "some kind of spinlock?", and as I thought it was
> unrelated to gc, no jvm arguments were attached. Well, it's a lock,
> but it's quite a bit more complex than I assumed.
> 
> 
> Any help would be appreciated. I've got no experience at all in
> hacking jvm's, but I can try to create a test case to trigger this
> situation manually and try different builds of the jvm.
> 
> 
> Attached is the full stack trace of all threads during the 'lag
> spike'. Note that the stack dump is triggered over http, using java.io
> / streams; that one kept working while all requests on a channel were
> blocked. "InstrumentedConnector" is just simple a subclass of the
> default jetty SelectChannelConnector, with extra stats to help
> figuring out what's happening.
> 
> 
> Chi Ho Kwok
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> hotspot-gc-use mailing list
> hotspot-gc-use at openjdk.java.net
> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use


From chkwok at digibites.nl  Sun Jan 17 05:22:18 2010
From: chkwok at digibites.nl (Chi Ho Kwok)
Date: Sun, 17 Jan 2010 14:22:18 +0100
Subject: GC_Locker turning CMS into a stop-the-world collector
In-Reply-To: <4B52CE22.1050703@sun.com>
References: <1b9d6f691001161704x955d00drcf1b126565db43d0@mail.gmail.com>
	<1b9d6f691001161706i4cf4c0f2ve2535b7e505ad08a@mail.gmail.com>
	<4B52CE22.1050703@sun.com>
Message-ID: <1b9d6f691001170522n320b7d9fp9c91e60388831efe@mail.gmail.com>

Hi Ramki,

On Sun, Jan 17, 2010 at 9:45 AM, Y. Srinivas Ramakrishna
<Y.S.Ramakrishna at sun.com> wrote:
> Hi Chi Ho --
>
> What's the version of the JDK you are using? Could you share a
> GC log using the JVM options -XX:+PrintGCDetails -XX:+PrintGCTimeStamps,
> a list of the full set of JVM options you used, and the periods when you see
> the
> long stalls. The intent of the design is that full gc's be avoided
> as much as possible and certainly not stop-world full gc's when using
> CMS except under very special circumstances. Perhaps looking at the
> shape of yr heap and its occupancy when you see the stop-world
> full gc happen that causes the long stall will allow us to see if
> you are occasionally running into that situation, and suggest
> a suitable workaround, or a fix within the JVM.

JDK 6u17 at the moment. Full options are: -ea -server -Xms16G -Xmx16G
-XX:+UseCompressedOops -XX:+UseConcMarkSweepGC -XX:MaxNewSize=768m
-XX:NewSize=768m -Xloggc:/var/log/wol-server/gc.log
-XX:+PrintGCDetails -XX:+PrintGC -XX:CMSInitiatingOccupancyFraction=78
-XX:SurvivorRatio=2 -XX:+ExplicitGCInvokesConcurrent

We've been running with -XX:+PrintGCDetails -XX:+PrintGC
-XX:+ExplicitGCInvokesConcurrent and logging it to a file, but sadly,
I didn't think of saving it - every time the service is restarted, the
log is overwritten. But I've checked - there are no concurrent mode
failures (adjusted CMS threshold down every time that happened in the
past), just "weird" stalls without a cause. Just found and added
PrintGCTimeStamps now, with only times relative from the start it's
hard to find a specific time, especially when the time goes
6-digits...

There's even a "watchdog" thread that guards against this, basically,
it's a thread that wakes up every 200ms and reports when it's been
stalled, so any real stop the world collections will show up in the
error log.

Since the switch from nio to normal sockets, the nagios log has been
clean. It used to be full of 1 line high response time service alerts
and followed by an OK in 0.0001s message.

> That having been said, we know of a different problem that can, with
> the current design cause what appear to be stalls (but are really
> slow-path allocations which can be quite slow) when Eden has been
> exhausted and JNI critical sections are held. There may be ways of
> avoiding that situation by a suitable small tweak of the current
> design for managing the interaction of JNI critical sections
> and GC.
>
> Anyway, a test case and the info I requested above should help us
> diagnose the actual problem you are running into, speedily, and
> thence find a workaround or fix.

I'll try my best to make a small test case. Seems like copying into /
out of direct buffers will do a lock_critical, so I'll just have
multiple threads just doing that constantly, while holding like 2G of
data in the old gen, one thread allocating random stuff and holding
onto it to simulate garbage production, and report when all buffer
copying threads are stalled for longer than $number ms per
buffer.get(). I'll have some time later today to try this.

The app is kinda weird in how it uses memory - it loads many large
datasets (up to 50M ram each) into the heap, keeps it there and LRU
20% out when the old gen occupancy is getting near the the CMS
threshold, we use the MXBeans to measure it. While loading the data
set, it generates another 50M of transient garbage which shouldn't
leave eden. It's the number-crunching service powering
worldoflogs.com.


Chi Ho Kwok

From chkwok at digibites.nl  Sun Jan 17 07:29:57 2010
From: chkwok at digibites.nl (Chi Ho Kwok)
Date: Sun, 17 Jan 2010 16:29:57 +0100
Subject: GC_Locker turning CMS into a stop-the-world collector
In-Reply-To: <4B52CE22.1050703@sun.com>
References: <1b9d6f691001161704x955d00drcf1b126565db43d0@mail.gmail.com>
	<1b9d6f691001161706i4cf4c0f2ve2535b7e505ad08a@mail.gmail.com>
	<4B52CE22.1050703@sun.com>
Message-ID: <1b9d6f691001170729k6f3f1f3cv518685e479930af0@mail.gmail.com>

Hi Ramki,

On Sun, Jan 17, 2010 at 9:45 AM, Y. Srinivas Ramakrishna
<Y.S.Ramakrishna at sun.com> wrote:
> Hi Chi Ho --
>
> Anyway, a test case and the info I requested above should help us
> diagnose the actual problem you are running into, speedily, and
> thence find a workaround or fix.

I've managed to create a test case. Sorry for the size tho, it's a
tricky problem to reproduce, so the test case it kinda large at 280
lines. The basic idea is: two threads are running concurrently,
putting and getting data out of a direct byte buffer. We know that it
will invoke a lock_critical. There's also a memory consumer thread to
keep CMS busy, but it's throttled, sleeps between simulated file loads
for 100ms. The last thread will monitor the rest, and warn when
something unusual happens.

On my system, this will normally produce lines like "Mem: 0, executed
516 direct buffer operations, 2 load operations."; or good for copying
about 2GB/second.

But shortly after starting this stress test, the stats thread will
start writing to System.err; when the amount of direct buffer ops is
zero in a period, it will mark the current time to remember when the
'stop-the-world-for-nio' phase started, or print how long it has
blocked. This generates logs like:


Situation normal:

Mem: 109, executed 643 direct buffer operations, 2 load operations.
11.780: [GC 11.780: [ParNew: 196608K->65536K(196608K), 0.1822832 secs]
399349K->401204K(2031616K), 0.1823253 secs] [Times: user=0.28
sys=0.05, real=0.18 secs]
Mem: 183, executed 335 direct buffer operations, 1 load operations.
11.963: [GC [1 CMS-initial-mark: 335668K(1835008K)] 403551K(2031616K),
0.0126970 secs] [Times: user=0.00 sys=0.00, real=0.01 secs]
11.976: [CMS-concurrent-mark-start]
12.127: [CMS-concurrent-mark: 0.151/0.151 secs] [Times: user=0.16
sys=0.00, real=0.15 secs]
12.127: [CMS-concurrent-preclean-start]
12.132: [CMS-concurrent-preclean: 0.004/0.005 secs] [Times: user=0.00
sys=0.00, real=0.00 secs]
12.132: [CMS-concurrent-abortable-preclean-start]

Blockage started:

Mem: 183, executed ~~ DirectByteBuffer.get or put calls are blocked,
started block timer ~~*
0 direct buffer operations, 3 load operations.
Mem: 183, executed ~~ DirectByteBuffer.get or put calls are blocked
since 1263740408693 for 250ms ~~
0 direct buffer operations, 2 load operations.
Mem: 183, executed ~~ DirectByteBuffer.get or put calls are blocked
since 1263740408693 for 500ms ~~
0 direct buffer operations, 3 load operations.
Mem: 183, executed ~~ DirectByteBuffer.get or put calls are blocked
since 1263740408693 for 750ms ~~
0 direct buffer operations, 2 load operations.
~~ DirectByteBuffer.get or put calls are blocked since 1263740408693
for 1000ms ~~
Mem: 183, executed 0 direct buffer operations, 3 load operations.
Mem: 183, executed 0 direct buffer operations, 2 load operations.
~~ DirectByteBuffer.get or put calls are blocked since 1263740408693
for 1250ms ~~
Mem: 183, executed 0~~ DirectByteBuffer.get or put calls are blocked
since 1263740408693 for 1500ms ~~
 direct buffer operations, 3 load operations.
13.931: [CMS-concurrent-abortable-preclean: 0.304/1.799 secs] [Times:
user=0.31 sys=0.00, real=1.80 secs]
13.932: [GC[YG occupancy: 137843 K (196608 K)]13.932: [Rescan
(parallel) , 0.0134340 secs]13.945: [weak refs processing, 0.0000059
secs] [1 CMS-remark: 335668K(1835008K)] 473512K(2031616K), 0.0135139
secs] [Times: user=0.03 sys=0.00, real=0.01 secs]
13.946: [CMS-concurrent-sweep-start]
Mem: 183, executed 0 direct buffer operations, 2 load operations.
~~ DirectByteBuffer.get or put calls are blocked since 1263740408693
for 1750ms ~~
13.988: [CMS-concurrent-sweep: 0.043/0.043 secs] [Times: user=0.03
sys=0.00, real=0.04 secs]
13.988: [CMS-concurrent-reset-start]
13.997: [CMS-concurrent-reset: 0.009/0.009 secs] [Times: user=0.03
sys=0.00, real=0.01 secs]

*: eclipse doesn't interleave stdout and stderr perfectly; the line
with blocked messages (~~'s) are written to stderr.

And back to normal again.

Mem: 183, executed 524 direct buffer operations, 2 load operations.


Note that the threads without nio calls are continuing as normal, but
the buffer operations are stuck during the whole CMS collection. That
is bad, because that means SocketChannelImpl.read/write can be stuck
in a web server for a long, long time. CPU utilization is spread
between several threads, but it doesn't approach 50% on this dual core
system, so there's no CPU time starvation.


Test case is attached, VM parameters: -ea -server -Xms2G -Xmx2G
-XX:+UseCompressedOops -XX:+UseConcMarkSweepGC -XX:MaxNewSize=256m
-XX:NewSize=256m -XX:+PrintGCDetails -XX:+PrintGC
-XX:CMSInitiatingOccupancyFraction=80
-XX:SurvivorRatio=2 -XX:+ExplicitGCInvokesConcurrent
-XX:+PrintGCTimeStamps; or basically what we run in prod with a
smaller heap.

If you use different heap sizes, you may have to tune GCStatus's
parameters; it's a quick hack, so the cache size isn't really stable;
in production, it's meant to LRU out 20%, wait for old gen usage to
drop before it can throw out another 20%. What I didn't get is why CMS
is active at low heap usage already - it started collecting pretty
much continuously right at the start of the test. Side effect from gc
locker initiated gc?

(also CC-ed directly because I'm not sure the attached file will get through)


Chi Ho Kwok
-------------- next part --------------
A non-text attachment was scrubbed...
Name: GCLockerTest.java
Type: application/octet-stream
Size: 8244 bytes
Desc: not available
Url : http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20100117/498c2019/attachment-0001.obj 

From chkwok at digibites.nl  Mon Jan 18 14:47:41 2010
From: chkwok at digibites.nl (Chi Ho Kwok)
Date: Mon, 18 Jan 2010 23:47:41 +0100
Subject: GC_Locker turning CMS into a stop-the-world collector
In-Reply-To: <E97BE0459087D74BB2BD0FE118B53F8E6CBAACB83E@GVW1343EXA.americas.hpqcorp.net>
References: <1b9d6f691001161704x955d00drcf1b126565db43d0@mail.gmail.com>
	<1b9d6f691001161706i4cf4c0f2ve2535b7e505ad08a@mail.gmail.com>
	<4B52CE22.1050703@sun.com>
	<1b9d6f691001170729k6f3f1f3cv518685e479930af0@mail.gmail.com>
	<E97BE0459087D74BB2BD0FE118B53F8E6CBAACB83E@GVW1343EXA.americas.hpqcorp.net>
Message-ID: <1b9d6f691001181447gead4104k5a2ae78f170a275d@mail.gmail.com>

Hi Doug,

On Mon, Jan 18, 2010 at 9:32 PM, Jones, Doug <doug.jones2 at hp.com> wrote:
> This is a long shot: in the logs below the problem behaviour appears to start in the abortable-preclean phase. That part of the CMS Collection does some interesting things, but can I believe be disabled by setting CMSMaxAbortablePrecleanTime to 0.
>
> You might like to try running your test program with the abortable-preclean phase turned off ...

Thanks, setting it to a very low value does help a lot. The test case
has been changed to increase reporting accuracy: in
DirectBufferStresser.run(), save the time before and after every
buffer operation, and logs every call taking longer than 100ms.
Disabled the other stderr warnings. The log with abortable preclean
time set to zero shows:

All OK:

Mem: 842, executed 450 direct buffer operations, 1 load operations.

CMS Start:

48.033: [GC [1 CMS-initial-mark: 1545976K(1835008K)]
1614526K(2031616K), 0.0135188 secs] [Times: user=0.00 sys=0.00,
real=0.01 secs]
48.047: [CMS-concurrent-mark-start]

Workers got stuck:

Mem: 842, executed 0 direct buffer operations, 3 load operations.
48.505: [CMS-concurrent-mark: 0.458/0.458 secs] [Times: user=0.45
sys=0.00, real=0.46 secs]
48.505: [CMS-concurrent-preclean-start]
48.510: [CMS-concurrent-preclean: 0.004/0.004 secs] [Times: user=0.00
sys=0.00, real=0.00 secs]

Preclean aborted in 0.02s real time:

48.510: [CMS-concurrent-abortable-preclean-start]
 CMS: abort preclean due to time 48.530:
[CMS-concurrent-abortable-preclean: 0.020/0.020 secs] [Times:
user=0.03 sys=0.00, real=0.02 secs]
48.530: [GC[YG occupancy: 83340 K (196608 K)]48.530: [Rescan
(parallel) , 0.0095904 secs]48.540: [weak refs processing, 0.0000068
secs] [1 CMS-remark: 1545976K(1835008K)] 1629316K(2031616K), 0.0096730
secs] [Times: user=0.00 sys=0.00, real=0.01 secs]

Sweeping, threads still stuck:

48.540: [CMS-concurrent-sweep-start]
Mem: 842, executed 0 direct buffer operations, 2 load operations.
48.782: [CMS-concurrent-sweep: 0.242/0.242 secs] [Times: user=0.27
sys=0.00, real=0.24 secs]
48.782: [CMS-concurrent-reset-start]

GC is over, CMS-concurrent-reset doesn't block stuff, I guess;
everything is back to normal, but the workers complain about the time
spent inside one buffer operation.

Last buffer operation took 1001 ms
Mem: 684, executed 11 direct buffer operations, 3 load operations.
48.791: [CMS-concurrent-reset: 0.009/0.009 secs] [Times: user=0.00
sys=0.00, real=0.01 secs]
Last buffer operation took 1010 ms
Mem: 684, executed 654 direct buffer operations, 2 load operations.


This is quite a bit better than "Last buffer operation took 5971 ms"
running with the old vm arguments, but still, extrapolating to a 16G
heap, CMS will hold up work for about 8 seconds. Less bad than 20+, so
I'll be applying this workaround to prevent the last few nio calls
being stuck for too long. From what I've read, the abortable preclean
is just doing some work in advance for remark so that doesn't take too
long and trying to wait for a desired occupancy in Eden. I'll just cap
it at one second unless someone yells *don't*.


Chi Ho Kwok

From Y.S.Ramakrishna at Sun.COM  Mon Jan 18 21:07:14 2010
From: Y.S.Ramakrishna at Sun.COM (Y. Srinivas Ramakrishna)
Date: Mon, 18 Jan 2010 21:07:14 -0800
Subject: GC_Locker turning CMS into a stop-the-world collector
In-Reply-To: <1b9d6f691001181447gead4104k5a2ae78f170a275d@mail.gmail.com>
References: <1b9d6f691001161704x955d00drcf1b126565db43d0@mail.gmail.com>
	<1b9d6f691001161706i4cf4c0f2ve2535b7e505ad08a@mail.gmail.com>
	<4B52CE22.1050703@sun.com>
	<1b9d6f691001170729k6f3f1f3cv518685e479930af0@mail.gmail.com>
	<E97BE0459087D74BB2BD0FE118B53F8E6CBAACB83E@GVW1343EXA.americas.hpqcorp.net>
	<1b9d6f691001181447gead4104k5a2ae78f170a275d@mail.gmail.com>
Message-ID: <4B553E02.1020003@sun.com>

Hi Chi Ho, Doug --

Thanks for the data, the test case and the summary of yr
observations. I have not tried the test case yet (US holiday today)
but will be looking at this carefully this week and update you.
 From the behaviour you have described in email and knowing the
implementation of GC locker and the dependencies that arise here,
I have a good hunch as to what is going on here, but will update
only after I have actually confirmed the huncg (or otherwise
found the root cause).

Thanks again for calling this in, and I look forward to
posting an update soon (including the CR opened to track
this issue).

thanks!
- ramki

Chi Ho Kwok wrote:
> Hi Doug,
> 
> On Mon, Jan 18, 2010 at 9:32 PM, Jones, Doug <doug.jones2 at hp.com> wrote:
>> This is a long shot: in the logs below the problem behaviour appears to start in the abortable-preclean phase. That part of the CMS Collection does some interesting things, but can I believe be disabled by setting CMSMaxAbortablePrecleanTime to 0.
>>
>> You might like to try running your test program with the abortable-preclean phase turned off ...
> 
> Thanks, setting it to a very low value does help a lot. The test case
> has been changed to increase reporting accuracy: in
> DirectBufferStresser.run(), save the time before and after every
> buffer operation, and logs every call taking longer than 100ms.
> Disabled the other stderr warnings. The log with abortable preclean
> time set to zero shows:
> 
> All OK:
> 
> Mem: 842, executed 450 direct buffer operations, 1 load operations.
> 
> CMS Start:
> 
> 48.033: [GC [1 CMS-initial-mark: 1545976K(1835008K)]
> 1614526K(2031616K), 0.0135188 secs] [Times: user=0.00 sys=0.00,
> real=0.01 secs]
> 48.047: [CMS-concurrent-mark-start]
> 
> Workers got stuck:
> 
> Mem: 842, executed 0 direct buffer operations, 3 load operations.
> 48.505: [CMS-concurrent-mark: 0.458/0.458 secs] [Times: user=0.45
> sys=0.00, real=0.46 secs]
> 48.505: [CMS-concurrent-preclean-start]
> 48.510: [CMS-concurrent-preclean: 0.004/0.004 secs] [Times: user=0.00
> sys=0.00, real=0.00 secs]
> 
> Preclean aborted in 0.02s real time:
> 
> 48.510: [CMS-concurrent-abortable-preclean-start]
>  CMS: abort preclean due to time 48.530:
> [CMS-concurrent-abortable-preclean: 0.020/0.020 secs] [Times:
> user=0.03 sys=0.00, real=0.02 secs]
> 48.530: [GC[YG occupancy: 83340 K (196608 K)]48.530: [Rescan
> (parallel) , 0.0095904 secs]48.540: [weak refs processing, 0.0000068
> secs] [1 CMS-remark: 1545976K(1835008K)] 1629316K(2031616K), 0.0096730
> secs] [Times: user=0.00 sys=0.00, real=0.01 secs]
> 
> Sweeping, threads still stuck:
> 
> 48.540: [CMS-concurrent-sweep-start]
> Mem: 842, executed 0 direct buffer operations, 2 load operations.
> 48.782: [CMS-concurrent-sweep: 0.242/0.242 secs] [Times: user=0.27
> sys=0.00, real=0.24 secs]
> 48.782: [CMS-concurrent-reset-start]
> 
> GC is over, CMS-concurrent-reset doesn't block stuff, I guess;
> everything is back to normal, but the workers complain about the time
> spent inside one buffer operation.
> 
> Last buffer operation took 1001 ms
> Mem: 684, executed 11 direct buffer operations, 3 load operations.
> 48.791: [CMS-concurrent-reset: 0.009/0.009 secs] [Times: user=0.00
> sys=0.00, real=0.01 secs]
> Last buffer operation took 1010 ms
> Mem: 684, executed 654 direct buffer operations, 2 load operations.
> 
> 
> This is quite a bit better than "Last buffer operation took 5971 ms"
> running with the old vm arguments, but still, extrapolating to a 16G
> heap, CMS will hold up work for about 8 seconds. Less bad than 20+, so
> I'll be applying this workaround to prevent the last few nio calls
> being stuck for too long. From what I've read, the abortable preclean
> is just doing some work in advance for remark so that doesn't take too
> long and trying to wait for a desired occupancy in Eden. I'll just cap
> it at one second unless someone yells *don't*.
> 
> 
> Chi Ho Kwok
> _______________________________________________
> hotspot-gc-use mailing list
> hotspot-gc-use at openjdk.java.net
> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use


From chkwok at digibites.nl  Sat Jan 23 16:52:21 2010
From: chkwok at digibites.nl (Chi Ho Kwok)
Date: Sun, 24 Jan 2010 01:52:21 +0100
Subject: GC_Locker turning CMS into a stop-the-world collector
In-Reply-To: <4B553E02.1020003@sun.com>
References: <1b9d6f691001161704x955d00drcf1b126565db43d0@mail.gmail.com>
	<1b9d6f691001161706i4cf4c0f2ve2535b7e505ad08a@mail.gmail.com>
	<4B52CE22.1050703@sun.com>
	<1b9d6f691001170729k6f3f1f3cv518685e479930af0@mail.gmail.com>
	<E97BE0459087D74BB2BD0FE118B53F8E6CBAACB83E@GVW1343EXA.americas.hpqcorp.net>
	<1b9d6f691001181447gead4104k5a2ae78f170a275d@mail.gmail.com>
	<4B553E02.1020003@sun.com>
Message-ID: <1b9d6f691001231652n20f026dej73230a24e0484e16@mail.gmail.com>

Just a little update from our side:

The situation improved a lot since we switched to sockets on the http side,
but loading of files are still blocked threads once in a while, on a
MappedByteBuffer.read() call. Tuning CMS to hurry up and finish asap by
disabling abortable-preclean and settings threads to 6 helps (hw: 2x4 core
prev gen Xeons), but the app still throws a "503 we're too busy, try again
later" a few times per hour during the rush hour. That happens when a thread
gets stuck waiting for on a semaphore that limits the amount of threads in
the loader (to prevent stuck -> unstuck -> 60 threads fighting for the CPU
and nothing gets done), when the timeout of 6 seconds expire.

That's why we've decided to rewrite every single nio file operation into
stream based ones, it's now done and deployed it on the server for testing,
that should solve our problem for the moment. It's kinda annoying tho, as
most of the code expects a ByteBuffer to work on, so instead of just mapping
it, we have allocate the buffer and fill it with FileInputStream.read() in a
loop, producing extra garbage.

Chi Ho Kwok

On Tue, Jan 19, 2010 at 6:07 AM, Y. Srinivas Ramakrishna <
Y.S.Ramakrishna at sun.com> wrote:

> Hi Chi Ho, Doug --
>
> Thanks for the data, the test case and the summary of yr
> observations. I have not tried the test case yet (US holiday today)
> but will be looking at this carefully this week and update you.
> From the behaviour you have described in email and knowing the
> implementation of GC locker and the dependencies that arise here,
> I have a good hunch as to what is going on here, but will update
> only after I have actually confirmed the huncg (or otherwise
> found the root cause).
>
> Thanks again for calling this in, and I look forward to
> posting an update soon (including the CR opened to track
> this issue).
>
> thanks!
>
> - ramki
>
> Chi Ho Kwok wrote:
>
>> Hi Doug,
>>
>> On Mon, Jan 18, 2010 at 9:32 PM, Jones, Doug <doug.jones2 at hp.com> wrote:
>>
>>> This is a long shot: in the logs below the problem behaviour appears to
>>> start in the abortable-preclean phase. That part of the CMS Collection does
>>> some interesting things, but can I believe be disabled by setting
>>> CMSMaxAbortablePrecleanTime to 0.
>>>
>>> You might like to try running your test program with the
>>> abortable-preclean phase turned off ...
>>>
>>
>> Thanks, setting it to a very low value does help a lot. The test case
>> has been changed to increase reporting accuracy: in
>> DirectBufferStresser.run(), save the time before and after every
>> buffer operation, and logs every call taking longer than 100ms.
>> Disabled the other stderr warnings. The log with abortable preclean
>> time set to zero shows:
>>
>> All OK:
>>
>> Mem: 842, executed 450 direct buffer operations, 1 load operations.
>>
>> CMS Start:
>>
>> 48.033: [GC [1 CMS-initial-mark: 1545976K(1835008K)]
>> 1614526K(2031616K), 0.0135188 secs] [Times: user=0.00 sys=0.00,
>> real=0.01 secs]
>> 48.047: [CMS-concurrent-mark-start]
>>
>> Workers got stuck:
>>
>> Mem: 842, executed 0 direct buffer operations, 3 load operations.
>> 48.505: [CMS-concurrent-mark: 0.458/0.458 secs] [Times: user=0.45
>> sys=0.00, real=0.46 secs]
>> 48.505: [CMS-concurrent-preclean-start]
>> 48.510: [CMS-concurrent-preclean: 0.004/0.004 secs] [Times: user=0.00
>> sys=0.00, real=0.00 secs]
>>
>> Preclean aborted in 0.02s real time:
>>
>> 48.510: [CMS-concurrent-abortable-preclean-start]
>>  CMS: abort preclean due to time 48.530:
>> [CMS-concurrent-abortable-preclean: 0.020/0.020 secs] [Times:
>> user=0.03 sys=0.00, real=0.02 secs]
>> 48.530: [GC[YG occupancy: 83340 K (196608 K)]48.530: [Rescan
>> (parallel) , 0.0095904 secs]48.540: [weak refs processing, 0.0000068
>> secs] [1 CMS-remark: 1545976K(1835008K)] 1629316K(2031616K), 0.0096730
>> secs] [Times: user=0.00 sys=0.00, real=0.01 secs]
>>
>> Sweeping, threads still stuck:
>>
>> 48.540: [CMS-concurrent-sweep-start]
>> Mem: 842, executed 0 direct buffer operations, 2 load operations.
>> 48.782: [CMS-concurrent-sweep: 0.242/0.242 secs] [Times: user=0.27
>> sys=0.00, real=0.24 secs]
>> 48.782: [CMS-concurrent-reset-start]
>>
>> GC is over, CMS-concurrent-reset doesn't block stuff, I guess;
>> everything is back to normal, but the workers complain about the time
>> spent inside one buffer operation.
>>
>> Last buffer operation took 1001 ms
>> Mem: 684, executed 11 direct buffer operations, 3 load operations.
>> 48.791: [CMS-concurrent-reset: 0.009/0.009 secs] [Times: user=0.00
>> sys=0.00, real=0.01 secs]
>> Last buffer operation took 1010 ms
>> Mem: 684, executed 654 direct buffer operations, 2 load operations.
>>
>>
>> This is quite a bit better than "Last buffer operation took 5971 ms"
>> running with the old vm arguments, but still, extrapolating to a 16G
>> heap, CMS will hold up work for about 8 seconds. Less bad than 20+, so
>> I'll be applying this workaround to prevent the last few nio calls
>> being stuck for too long. From what I've read, the abortable preclean
>> is just doing some work in advance for remark so that doesn't take too
>> long and trying to wait for a desired occupancy in Eden. I'll just cap
>> it at one second unless someone yells *don't*.
>>
>>
>> Chi Ho Kwok
>> _______________________________________________
>> hotspot-gc-use mailing list
>> hotspot-gc-use at openjdk.java.net
>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20100124/15517c7a/attachment.html 

From Y.S.Ramakrishna at Sun.COM  Mon Jan 25 11:31:10 2010
From: Y.S.Ramakrishna at Sun.COM (Y. Srinivas Ramakrishna)
Date: Mon, 25 Jan 2010 11:31:10 -0800
Subject: GC_Locker turning CMS into a stop-the-world collector
In-Reply-To: <1b9d6f691001170729k6f3f1f3cv518685e479930af0@mail.gmail.com>
References: <1b9d6f691001161704x955d00drcf1b126565db43d0@mail.gmail.com>
	<1b9d6f691001161706i4cf4c0f2ve2535b7e505ad08a@mail.gmail.com>
	<4B52CE22.1050703@sun.com>
	<1b9d6f691001170729k6f3f1f3cv518685e479930af0@mail.gmail.com>
Message-ID: <4B5DF17E.6080203@sun.com>

Just a very quick update on this. Chi Ho and i have been
communicating off-line on this yesterday and today,
and i filed:-

    6919638 CMS: NIO-related performance issue, gc locker implementation suspected

and have updated it with what we know so far (the updates
appear somewhat delayed at bugs.sun.com).

what is known so far is that the problem is mitigated in
jdk 7 because the array copies are broken down into
smaller critical sections; as a result the test case
does not exhibit the blocking behaviour on jdk7.
Not so in jdk 6uXX where the array copies occur in
a single critical section, and the blockage is easily
reproducible.

What is not yet clear to me (although i am looking
into it) is why the blockages seem to coincide with CMS
background collections. Investigation ongoing, and
the bug report will be kept updated as we find out more.

Chi Ho's latest test case (attached to the bug report) was
crucial in quickly reproducing the reported problem (thanks, Chi Ho!).
-- ramki


From Y.S.Ramakrishna at Sun.COM  Mon Jan 25 11:58:08 2010
From: Y.S.Ramakrishna at Sun.COM (Y. Srinivas Ramakrishna)
Date: Mon, 25 Jan 2010 11:58:08 -0800
Subject: GC_Locker turning CMS into a stop-the-world collector
In-Reply-To: <4B5DF17E.6080203@sun.com>
References: <1b9d6f691001161704x955d00drcf1b126565db43d0@mail.gmail.com>
	<1b9d6f691001161706i4cf4c0f2ve2535b7e505ad08a@mail.gmail.com>
	<4B52CE22.1050703@sun.com>
	<1b9d6f691001170729k6f3f1f3cv518685e479930af0@mail.gmail.com>
	<4B5DF17E.6080203@sun.com>
Message-ID: <4B5DF7D0.7090601@sun.com>

Thanks to Chi Ho's observation that -XX:+ExplicitGCInvokesConcurrent
is essential to reproduce the problem. I now know _exactly_ the
root cause of the problem. A fix will be made in the near future.
The bug report will be updated. Thanks again to chi Ho for the test
case and the crucial observation that led to the diagnosis
of the problem.

over and out.
-- ramki

Y. Srinivas Ramakrishna wrote:
> Just a very quick update on this. Chi Ho and i have been
> communicating off-line on this yesterday and today,
> and i filed:-
> 
>     6919638 CMS: NIO-related performance issue, gc locker implementation suspected
> 
> and have updated it with what we know so far (the updates
> appear somewhat delayed at bugs.sun.com).
> 
> what is known so far is that the problem is mitigated in
> jdk 7 because the array copies are broken down into
> smaller critical sections; as a result the test case
> does not exhibit the blocking behaviour on jdk7.
> Not so in jdk 6uXX where the array copies occur in
> a single critical section, and the blockage is easily
> reproducible.
> 
> What is not yet clear to me (although i am looking
> into it) is why the blockages seem to coincide with CMS
> background collections. Investigation ongoing, and
> the bug report will be kept updated as we find out more.
> 
> Chi Ho's latest test case (attached to the bug report) was
> crucial in quickly reproducing the reported problem (thanks, Chi Ho!).
> -- ramki
> 
> _______________________________________________
> hotspot-gc-use mailing list
> hotspot-gc-use at openjdk.java.net
> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use


From shane.cox at gmail.com  Thu Jan 28 09:04:49 2010
From: shane.cox at gmail.com (Shane Cox)
Date: Thu, 28 Jan 2010 12:04:49 -0500
Subject: Minor GCs execute faster after first CMS collection
Message-ID: <dc6011321001280904h6fb5fecai71a4309c3a053b26@mail.gmail.com>

Could anyone explain why Minor GCs execute faster after the first CMS
collection completes?  I'm running a fairly steady state test and observe
100-200ms Minor GC pauses at the beginning:
2010-01-28T11:45:23.152-0500: 450.374: [GC 450.374: [ParNew:
74560K->10624K(74560K), 0.1361412 secs] 963986K->908709K(1898112K),
0.1362816 secs] [Times: user=0.12 sys=0.12, real=0.14 secs]
2010-01-28T11:45:25.921-0500: 453.144: [GC 453.144: [ParNew:
74560K->10624K(74560K), 0.1757643 secs] 972645K->919770K(1898112K),
0.1759141 secs] [Times: user=0.14 sys=0.15, real=0.18 secs]
2010-01-28T11:45:28.921-0500: 456.143: [GC 456.143: [ParNew:
74560K->10624K(74560K), 0.1516837 secs] 983706K->929407K(1898112K),
0.1518258 secs] [Times: user=0.13 sys=0.13, real=0.15 secs]

Then a CMS collection executes:
2010-01-28T11:45:29.074-0500: 456.296: [GC [1 CMS-initial-mark:
918783K(1823552K)] 930044K(1898112K), 0.0084243 secs] [Times: user=0.01
sys=0.00, real=0.01 secs]
......
2010-01-28T11:45:33.587-0500: 460.809: [CMS-concurrent-reset: 0.012/0.012
secs] [Times: user=0.01 sys=0.00, real=0.01 secs]


After the CMS collection, Minor GC pause times drop to 25-35ms:
2010-01-28T11:45:34.196-0500: 461.418: [GC 461.418: [ParNew:
74560K->10624K(74560K), 0.0349928 secs] 527737K->475222K(1898112K),
0.0351559 secs] [Times: user=0.11 sys=0.00, real=0.04 secs]
2010-01-28T11:45:36.641-0500: 463.863: [GC 463.863: [ParNew:
74560K->10624K(74560K), 0.0300849 secs] 539158K->484716K(1898112K),
0.0302200 secs] [Times: user=0.09 sys=0.00, real=0.03 secs]
2010-01-28T11:45:39.081-0500: 466.304: [GC 466.304: [ParNew:
74560K->10624K(74560K), 0.0300672 secs] 548652K->494809K(1898112K),
0.0302327 secs] [Times: user=0.09 sys=0.00, real=0.03 secs]


The only thing that stands out to me is "sys" time which drops to zero after
the CMS collection.

I'm guessing that an adjustment/optimization is being made after the CMS
collection, but don't know what.  Any help/ideas would be appreciated.

Thanks


Full GC log attached.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20100128/2e801716/attachment-0001.html 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 2010-01-28-RouterPTFX021_gc-output.log.gz
Type: application/x-gzip
Size: 23892 bytes
Desc: not available
Url : http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20100128/2e801716/attachment-0001.bin 

From Jon.Masamitsu at Sun.COM  Thu Jan 28 15:09:11 2010
From: Jon.Masamitsu at Sun.COM (Jon Masamitsu)
Date: Thu, 28 Jan 2010 15:09:11 -0800
Subject: Minor GCs execute faster after first CMS collection
In-Reply-To: <dc6011321001280904h6fb5fecai71a4309c3a053b26@mail.gmail.com>
References: <dc6011321001280904h6fb5fecai71a4309c3a053b26@mail.gmail.com>
Message-ID: <4B621917.8000405@Sun.COM>

Shane,

Do you ever sees this pattern repeating in
longer runs (ParNew pause times increasing
until a CMS cycle completes)?  There are
three CMS cycles in the log that you sent
the ParNew pause times look stable after the
first CMS cycle.

Jon

On 01/28/10 09:04, Shane Cox wrote:
> Could anyone explain why Minor GCs execute faster after the first CMS 
> collection completes?  I'm running a fairly steady state test and 
> observe 100-200ms Minor GC pauses at the beginning:
> 2010-01-28T11:45:23.152-0500: 450.374: [GC 450.374: [ParNew: 
> 74560K->10624K(74560K), 0.1361412 secs] 963986K->908709K(1898112K), 
> 0.1362816 secs] [Times: user=0.12 sys=0.12, real=0.14 secs]
> 2010-01-28T11:45:25.921-0500: 453.144: [GC 453.144: [ParNew: 
> 74560K->10624K(74560K), 0.1757643 secs] 972645K->919770K(1898112K), 
> 0.1759141 secs] [Times: user=0.14 sys=0.15, real=0.18 secs]
> 2010-01-28T11:45:28.921-0500: 456.143: [GC 456.143: [ParNew: 
> 74560K->10624K(74560K), 0.1516837 secs] 983706K->929407K(1898112K), 
> 0.1518258 secs] [Times: user=0.13 sys=0.13, real=0.15 secs]
> 
> Then a CMS collection executes:
> 2010-01-28T11:45:29.074-0500: 456.296: [GC [1 CMS-initial-mark: 
> 918783K(1823552K)] 930044K(1898112K), 0.0084243 secs] [Times: user=0.01 
> sys=0.00, real=0.01 secs]
> ......
> 2010-01-28T11:45:33.587-0500: 460.809: [CMS-concurrent-reset: 
> 0.012/0.012 secs] [Times: user=0.01 sys=0.00, real=0.01 secs]
> 
> 
> 
> After the CMS collection, Minor GC pause times drop to 25-35ms:
> 2010-01-28T11:45:34.196-0500: 461.418: [GC 461.418: [ParNew: 
> 74560K->10624K(74560K), 0.0349928 secs] 527737K->475222K(1898112K), 
> 0.0351559 secs] [Times: user=0.11 sys=0.00, real=0.04 secs]
> 2010-01-28T11:45:36.641-0500: 463.863: [GC 463.863: [ParNew: 
> 74560K->10624K(74560K), 0.0300849 secs] 539158K->484716K(1898112K), 
> 0.0302200 secs] [Times: user=0.09 sys=0.00, real=0.03 secs]
> 2010-01-28T11:45:39.081-0500: 466.304: [GC 466.304: [ParNew: 
> 74560K->10624K(74560K), 0.0300672 secs] 548652K->494809K(1898112K), 
> 0.0302327 secs] [Times: user=0.09 sys=0.00, real=0.03 secs]
> 
> 
> The only thing that stands out to me is "sys" time which drops to zero 
> after the CMS collection.
> 
> I'm guessing that an adjustment/optimization is being made after the CMS 
> collection, but don't know what.  Any help/ideas would be appreciated.
> 
> Thanks
> 
> 
> Full GC log attached.
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> hotspot-gc-use mailing list
> hotspot-gc-use at openjdk.java.net
> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use