64 bit CMS JDK 5.0 u14

Mon Dec 10 19:55:49 UTC 2007

Tony,

I intend to elaborate/add GC logs later; but to quickly dispell any misunderstanding. We currently deploy most of our middle-tier apps running in a VM implementing CMS; the one major EBI app that tends not to play well with CMS is our Portal app. However, at 64-bit, the Portal performs as well as it does in 32 bit with the throughput collector.

I prefer the CMS algorithm since the low pause is crucial. The Portal generates a great deal of small short-lived objects; and sometimes in 32 bit mode we overrun the algorithm, even using survivor spaces; fragmentation is usually an issue with our CMS experieces in 32 bit with Portal. I, too, am very happy with CMS at 32-bit; just a little hard to fine tune our Portal with it; but again I aware of small memory leaks with Portal.

I am very encouraged by the CMS algorithm on AMD 64 bit with Portal; just a little concerned about some of the 400-600 seconds Full GCs when the concurrent mode failures kick in. Fragmentation is at play for sure; so utilizing the SS1 and SS2  may help. The promotion guarantee is still an issue at 5.0 u14?

I shall add more details later. But I really appreciate all your feedback.

keith

Keith R Holdaway
Java Development Technologies

SAS...  The Power to Know

Carpe Diem ...

-----Original Message-----
From: Tony Printezis [mailto:tony.printezis at sun.com]
Sent: Monday, December 10, 2007 1:59 PM
To: Y.S.Ramakrishna at sun.com
Cc: Keith Holdaway; hotspot-gc-dev at openjdk.java.net
Subject: Re: 64 bit CMS JDK 5.0 u14

Keith and Ramki,

I'd like to add a couple of things to Ramki's informative reply.

First, I wonder why the Portal ran fine with the default GC in 1.4.2
with no failed (I assumed timed out, right?) transactions. The default
(aka serial) GC is not a low-pause GC, so maybe the pause times it gave
you were good enough. BTW, what are your latency targets? hundreds of
ms? seconds?

Second, we'd be able to help you a bit more constructively if you gave
us some GC logs to look at (and if you have both WLS 8.1 and 9.2 logs,
it'd be nice to get them so that we can compare them).

 > CMS never works very well in the 32-bit environment

FWIW, a lot of our customers are happy with CMS in a 32-bit environment! :-)

Tony

Y.S.Ramakrishna at sun.com wrote:
> Hi Keith --
>
>> I am running some midle-tier Portal load tests in WLS 9.2 MP2 with
>> Sun JDK 5.0 u14. I am running 100 concurrent users that logon,
>> navigate, open portlets, and eventually logoff; only to logon again
>> and repeat the cycle. My testing people have establish Load Runner
>> scripts to put the Portal software through an endurance test over 5
>> days with the 100 users.
>>
>> Normally on 32-bit WLS 8.1 SP6 with JDK 1.4.2_13 we run with the
>> following VM args; we attain some 3.4 million passed transactions
>> with zero failed transactions:
>>
>> -server -Xms1400m -Xmx1400m -XX:NewSize=64m -XX:MaxNewSize=64m
>> -XX:PermSize=128m -XX:MaxPermSize=128m -Xss128k -XX:-UseTLAB
>> -XX:+DisableExplicitGC
>> -Dsun.rmi.dgc.client.gcInterval=3600000
>> -Dsun.rmi.dgc.server.gcInterval=3600000 -Djava.awt.headless=true
>>
>> CMS never works very well in the 32-bit environment; failing
>> miserably above; although at JDK 1.4.2_15, we see some 2 million
>> passed transactions with 1130+ failed transactions owing to 120
>> seconds timeouts in concurrent mode failures.
>>
>> Now, in the 64 bit environment running on AMD Windows Server 2003, I
>> can run pretty successfully with CMS:
>>
>> -Xms1500m -Xmx3500m -XX:NewSize=320m -XX:MaxNewSize=320m -Xss256k
>> -XX:PermSize=128m -XX:MaxPermSize=128m -XX:+UseConcMarkSweepGC
>> -XX:CMSFullGCsBeforeCompaction=0 -XX:+UseCMSInitiatingOccupancyOnly
>> -XX:CMSInitiatingOccupancyFraction=40
>> -Dsun.rmi.dgc.client.gcInterval=3600000
>> -Dsun.rmi.dgc.server.gcInterval=3600000 -Djava.awt.headless=true
>> -Dcom.sun.management.jmxremote -verbosegc
>> -Xloggc:C:\keith\GCLogs\gc11.txt
>>
>> I can achieve some 3.4 million as for the throughput collector in 32
>> bit env.
>>
>> But, I do note several thousand failed transactions that correlate
>> with concurrent mode failures after some 24 hrs; pauses in the
>> 400-600 seconds range when Full GC takes over.
>>
>
> The fact that you start seeing the concurrent mode failures after
> 24 hours indicates to me strongly that the old generation gets slowly
> fragmented over a period of time. (Recall that the CMS collector is
> non-moving.)
>
> Can you confirm that the heap occupancy itself is constant (or nearly
> so) following CMS collection cycles, and that the full gc that follows
> a concurrent mode failure does not unload classes? Recall that CMS will
> not, by default, unload classes during concurrent cycles unless
> explicitly instructed to do so via:
>
>      -XX:+CMSClassUnloadingEnabled -XX:+PermGenSweepingEnabled
>
> (the second option is needed in pre-6.0 JVM's, but not in more
> recent JVM's).
>
>> I have tried varying the CMSInitiatingOccupancyFraction to 20%, but
>> the CMS mode failures still occur.
>
> It is usually a good idea to use survivor spaces to both reduce the
> pressure on the concurrent collector (by promoting less to the old
> gen), but also to reduce the spread in object sizes and lifetimes
> of the objects that do get promoted. I'd suggest using survivor
> spaces to make sure that survivors stay in the young gen for at least
> one scavenge (MaxTenuringThreshold = 1, possibly more, as experiments
> dictate), possibly more. A downside is possibly longer scavenges,
> but consider that the price for (possibly) avoiding concurrent
> mode failure.
>
> Prematurely promoting objects (besides the two points made above),
> can also reduce floating garbage and reduce CMS remark pause
> times (by reducing mutation rates in the old generation).
>
>>
>> I am now running with the incremental mode CMS; but anticipate
>> further very long pauses.
>
> From what you described above (running CMS all the time by setting the
> initiation threshold very low), it does not look as though iCMS will
> buy you anything.
>
>>
>> The VM always recovers very well after these sporadic Full GCs, but
>> to eradicate them, should I run with an 8 GB heap or something along
>> those lines.? I also read something about killing the swap file?
>>
>> My AMD 64 bit bx, unfortunately for now is restricted to 4 GB RAM;
>> but I am adding a further 4 GB soon. I am about to go to the Solaris
>> SPARC 64 bit and run the exact same scenario with a 7-8 GB heap.
>
> Increasing the heap size can indeed sometimes help you avoid
> concurrent mode failure from fragmentation. (But first make
> sure to enable survivor spaces and, if applicable, perm gen
> collection.)
>
>>
>> I read about the occupancy fration for OG and Perm Gen; do I need to
>> apply this patch. Our Perm Gen is always set to 128 MB and only ever
>> attains 108 MB.
>
> The webrev i posted late last week should not really apply directly to
> your case (except inasmuch as, in the event that you enable perm gen
> collection, it might allow you to get away with not collecting the
> perm gen per each cycle, and thus help keep cms remark pauses possibly
> shorter). I would not worry about this patch at the level at which you
> are tuning currently (which is mainly looking to avoid the concurrent
> mode failures).
>
> -- ramki
>
>>
>> Any feedback would help us in our endeavours to support our EBI apps
>> in a 64 bit env.
>>
>> keith
>>
>>
>> Keith R Holdaway
>> Java Development Technologies
>>
>> SAS...  The Power to Know
>>
>> Carpe Diem ...
>>
>

--
----------------------------------------------------------------------
| Tony Printezis, Staff Engineer    | Sun Microsystems Inc.          |
|                                   | MS BUR02-311                   |
| e-mail: tony.printezis at sun.com    | 35 Network Drive               |
| office: +1 781 442 0998 (x20998)  | Burlington, MA01803-0902, USA  |
----------------------------------------------------------------------
e-mail client: Thunderbird (Solaris)