From matt.fowles at gmail.com  Thu Jun  3 13:17:25 2010
From: matt.fowles at gmail.com (Matt Fowles)
Date: Thu, 3 Jun 2010 16:17:25 -0400
Subject: Growing GC Young Gen Times
In-Reply-To: <4BEDA55D.5030703@oracle.com>
References: <AANLkTik0Ft_M27QE4OGUi0Ycn1Fe82bUkFw0znEgch09@mail.gmail.com> 
	<4BEC2776.8010609@oracle.com>
	<AANLkTikQRchdpOHwMZfYi1sqYYgJBx0SAI_awaV_s1Ke@mail.gmail.com> 
	<4BEC7498.6030405@oracle.com> <4BEC7D4D.2000905@oracle.com> 
	<AANLkTil_AMIxaK-eKswAuaCIvhvakouu3GTWBS-ktj5x@mail.gmail.com> 
	<4BED8A17.9090208@oracle.com> <4BED8BF9.7000803@oracle.com> 
	<AANLkTiknmtrnka6XfnH6yZTjD4XlLltZatUquEedzpi0@mail.gmail.com> 
	<4BEDA55D.5030703@oracle.com>
Message-ID: <AANLkTimNDF68fcnuju-NlFd9CP0Kp15AnrgYjtCSCT-R@mail.gmail.com>

All~

Today we were able to isolate the piece of our system that is causing the
increase in GC times.  We can now reproduce on multiple machines in a much
faster manner (about 20 minutes).  The attached log represents a run from
just this component.  There is a full GC part way through the run that we
forced using visual vm.  The following interesting facts are presented for
your amusement/edification:

1) the young gen pause times increase steadily over the course of the run
2) the full GC doesn't effect the young gen pause times
3) the component in question makes heavy use of JNI

I suspect the 3rd fact is the most interesting one.  We will [obviously] be
looking into reducing the test case further and possibly fixing our JNI
code.  But this represents a huge leap forward in our understanding of the
problem.

Matt


On Fri, May 14, 2010 at 3:32 PM, Y. Srinivas Ramakrishna <
y.s.ramakrishna at oracle.com> wrote:

> Matt -- Yes, comparative data for all these for 6u20 and jdk 7
> would be great. Naturally, server 1 is most immediately useful
> for determining if 6631166 addresses this at all,
> but others would be useful too if it turns out it doesn't
> (i.e. if jdk 7's server 1 turns out to be no better than 6u20's --
> at which point we should get this into the right channel -- open a bug,
> and a support case).
>
> thanks.
> -- ramki
>
>
> On 05/14/10 12:24, Matt Fowles wrote:
>
>> Ramki~
>>
>> I am preparing the flags for the next 3 runs (which run in parallel) and
>> wanted to check a few things with you.  I believe that each of these is
>> collecting a useful data point,
>>
>> Server 1 is running with 8 threads, reduced young gen, and MTT 1.
>> Server 2 is running with 8 threads, reduced young gen, and MTT 1, ParNew,
>> but NOT CMS.
>> Server 3 is running with 8 threads, reduced young gen, and MTT 1, and
>> PrintFLSStatistics.
>>
>> I can (additionally) run all of these tests on JDK7 (Java HotSpot(TM)
>> 64-Bit Server VM (build 17.0-b05, mixed mode)).
>>
>> Server 1:
>>            -verbose:gc
>>            -XX:+PrintGCTimeStamps
>>            -XX:+PrintGCDetails
>>            -XX:+PrintGCTaskTimeStamps
>>            -XX:+PrintCommandLineFlags
>>
>>            -Xms32g -Xmx32g -Xmn1g
>>            -XX:+UseParNewGC
>>            -XX:ParallelGCThreads=8
>>            -XX:+UseConcMarkSweepGC
>>            -XX:ParallelCMSThreads=8
>>            -XX:MaxTenuringThreshold=1
>>            -XX:SurvivorRatio=14
>>            -XX:+CMSParallelRemarkEnabled
>>            -Xloggc:gc1.log
>>            -XX:+UseLargePages             -XX:+AlwaysPreTouch
>>
>> Server 2:
>>            -verbose:gc
>>            -XX:+PrintGCTimeStamps
>>            -XX:+PrintGCDetails
>>            -XX:+PrintGCTaskTimeStamps
>>            -XX:+PrintCommandLineFlags
>>
>>            -Xms32g -Xmx32g -Xmn1g
>>            -XX:+UseParNewGC
>>            -XX:ParallelGCThreads=8
>>            -XX:MaxTenuringThreshold=1
>>            -XX:SurvivorRatio=14
>>            -Xloggc:gc2.log
>>            -XX:+UseLargePages             -XX:+AlwaysPreTouch
>>
>>
>> Server 3:
>>            -verbose:gc
>>            -XX:+PrintGCTimeStamps
>>            -XX:+PrintGCDetails
>>            -XX:+PrintGCTaskTimeStamps
>>            -XX:+PrintCommandLineFlags
>>
>>            -Xms32g -Xmx32g -Xmn1g
>>            -XX:+UseParNewGC
>>            -XX:ParallelGCThreads=8
>>            -XX:+UseConcMarkSweepGC
>>            -XX:ParallelCMSThreads=8
>>            -XX:MaxTenuringThreshold=1
>>            -XX:SurvivorRatio=14
>>            -XX:+CMSParallelRemarkEnabled
>>            -Xloggc:gc3.log
>>
>>            -XX:PrintFLSStatistics=2
>>            -XX:+UseLargePages
>>            -XX:+AlwaysPreTouch
>>  Matt
>>
>> On Fri, May 14, 2010 at 1:44 PM, Y. Srinivas Ramakrishna <
>> y.s.ramakrishna at oracle.com <mailto:y.s.ramakrishna at oracle.com>> wrote:
>>  > On 05/14/10 10:36, Y. Srinivas Ramakrishna wrote:
>>  >>
>>  >> On 05/14/10 10:24, Matt Fowles wrote:
>>  >>>
>>  >>> Jon~
>>  >>>
>>  >>> That makes, sense but the fact is that the old gen *never* get
>>  >>> collected.  So all the allocations happen from the giant empty space
>>  >>> at the end of the free list.  I thought fragmentation only occurred
>>  >>> when the free lists are added to after freeing memory...
>>  >>
>>  >> As Jon indicated allocation is done from free lists of blocks
>>  >> that are pre-carved on demand to avoid contention while allocating.
>>  >> The old heuristics for how large to make those lists and the
>>  >> inventory to hold in those lists was not working well as you
>>  >> scaled the number of workers. Following 6631166 we believe it
>>  >> works better and causes both less contention and less
>>  >> fragmentation than it did before, because we do not hold
>>  >> unnecessary excess inventory of free blocks.
>>  >
>>  > To see what the fragmentation is, try -XX:PrintFLSStatistics=2.
>>  > This will slow down your scavenge pauses (perhaps by quite a bit
>>  > for your 26 GB heap), but you will get a report of the number of
>>  > blocks on free lists and how fragmented the space is on that ccount
>>  > (for some appropriate notion of fragmentation). Don't use that
>>  > flag in production though :-)
>>  >
>>  > -- ramki
>>  >
>>  >>
>>  >> The fragmentation in turn causes card-scanning to suffer
>>  >> adversely, besides the issues with loss of spatial locality also
>>  >> increasing cache misses and TLB misses. (The large page
>>  >> option might help mitigate the latter a bit, especially
>>  >> since you have such a large heap and our fragmented
>>  >> allocation may be exacerbating the TLB pressure.)
>>  >>
>>  >> -- ramki
>>  >>
>>  >>> Matt
>>  >>>
>>  >>> On Thu, May 13, 2010 at 6:29 PM, Jon Masamitsu <
>> jon.masamitsu at oracle.com <mailto:jon.masamitsu at oracle.com>>
>>
>>  >>> wrote:
>>  >>>>
>>  >>>> Matt,
>>  >>>>
>>  >>>> To amplify on Ramki's comment, the allocations out of the
>>  >>>> old generation are always from a free list.  During a young
>>  >>>> generation collection each GC thread will get its own
>>  >>>> local free lists from the old generation so that it can
>>  >>>> copy objects to the old generation without synchronizing
>>  >>>> with the other GC thread (most of the time).  Objects from
>>  >>>> a GC thread's local free lists are pushed to the globals lists
>>  >>>> after the collection (as far as I recall). So there is some
>>  >>>> churn in the free lists.
>>  >>>>
>>  >>>> Jon
>>  >>>>
>>  >>>> On 05/13/10 14:52, Y. Srinivas Ramakrishna wrote:
>>  >>>>>
>>  >>>>> On 05/13/10 10:50, Matt Fowles wrote:
>>  >>>>>>
>>  >>>>>> Jon~
>>  >>>>>>
>>  >>>>>> This may sound naive, but how can fragmentation be an issue if the
>> old
>>  >>>>>> gen has never been collected?  I would think we are still in the
>> space
>>  >>>>>> where we can just bump the old gen alloc pointer...
>>  >>>>>
>>  >>>>> Matt, The old gen allocator may fragment the space. Allocation is
>> not
>>  >>>>> exactly "bump a pointer".
>>  >>>>>
>>  >>>>> -- ramki
>>  >>>>>
>>  >>>>>> Matt
>>  >>>>>>
>>  >>>>>> On Thu, May 13, 2010 at 12:23 PM, Jon Masamitsu
>>  >>>>>> <jon.masamitsu at oracle.com <mailto:jon.masamitsu at oracle.com>>
>> wrote:
>>  >>>>>>>
>>  >>>>>>> Matt,
>>  >>>>>>>
>>  >>>>>>> As Ramki indicated fragmentation might be an issue.  As the
>>  >>>>>>> fragmentation
>>  >>>>>>> in the old generation increases, it takes longer to find space in
>> the
>>  >>>>>>> old
>>  >>>>>>> generation
>>  >>>>>>> into which to promote objects from the young generation.  This is
>>  >>>>>>> apparently
>>  >>>>>>> not
>>  >>>>>>> the problem that Wayne is having but you still might be hitting
>> it.
>>  >>>>>>>  If
>>  >>>>>>> you
>>  >>>>>>> can
>>  >>>>>>> connect jconsole to the VM and force a full GC, that would tell
>> us if
>>  >>>>>>> it's
>>  >>>>>>> fragmentation.
>>  >>>>>>>
>>  >>>>>>> There might be a scaling issue with the UseParNewGC.  If you can
>> use
>>  >>>>>>> -XX:-UseParNewGC (turning off the parallel young
>>  >>>>>>> generation collection) with  -XX:+UseConcMarkSweepGC the pauses
>>  >>>>>>> will be longer but may be more stable.  That's not the solution
>> but
>>  >>>>>>> just
>>  >>>>>>> part
>>  >>>>>>> of the investigation.
>>  >>>>>>>
>>  >>>>>>> You could try just -XX:+UseParNewGC without
>> -XX:+UseConcMarkSweepGC
>>  >>>>>>> and if you don't see the growing young generation pause, that
>> would
>>  >>>>>>> indicate
>>  >>>>>>> something specific about promotion into the CMS generation.
>>  >>>>>>>
>>  >>>>>>> UseParallelGC is different from UseParNewGC in a number of ways
>>  >>>>>>> and if you try UseParallelGC and still see the growing young
>>  >>>>>>> generation
>>  >>>>>>> pauses, I'd suspect something special about your application.
>>  >>>>>>>
>>  >>>>>>> If you can run these experiments hopefully they will tell
>>  >>>>>>> us where to look next.
>>  >>>>>>>
>>  >>>>>>> Jon
>>  >>>>>>>
>>  >>>>>>>
>>  >>>>>>> On 05/12/10 15:19, Matt Fowles wrote:
>>  >>>>>>>
>>  >>>>>>> All~
>>  >>>>>>>
>>  >>>>>>> I have a large app that produces ~4g of garbage every 30 seconds
>> and
>>  >>>>>>> am trying to reduce the size of gc outliers.  About 99% of this
>> data
>>  >>>>>>> is garbage, but almost anything that survives one collection
>> survives
>>  >>>>>>> for an indeterminately long amount of time.  We are currently
>> using
>>  >>>>>>> the following VM and options:
>>  >>>>>>>
>>  >>>>>>> java version "1.6.0_20"
>>  >>>>>>> Java(TM) SE Runtime Environment (build 1.6.0_20-b02)
>>  >>>>>>> Java HotSpot(TM) 64-Bit Server VM (build 16.3-b01, mixed mode)
>>  >>>>>>>
>>  >>>>>>>              -verbose:gc
>>  >>>>>>>              -XX:+PrintGCTimeStamps
>>  >>>>>>>              -XX:+PrintGCDetails
>>  >>>>>>>              -XX:+PrintGCTaskTimeStamps
>>  >>>>>>>              -XX:+PrintTenuringDistribution
>>  >>>>>>>              -XX:+PrintCommandLineFlags
>>  >>>>>>>              -XX:+PrintReferenceGC
>>  >>>>>>>              -Xms32g -Xmx32g -Xmn4g
>>  >>>>>>>              -XX:+UseParNewGC
>>  >>>>>>>              -XX:ParallelGCThreads=4
>>  >>>>>>>              -XX:+UseConcMarkSweepGC
>>  >>>>>>>              -XX:ParallelCMSThreads=4
>>  >>>>>>>              -XX:CMSInitiatingOccupancyFraction=60
>>  >>>>>>>              -XX:+UseCMSInitiatingOccupancyOnly
>>  >>>>>>>              -XX:+CMSParallelRemarkEnabled
>>  >>>>>>>              -XX:MaxGCPauseMillis=50
>>  >>>>>>>              -Xloggc:gc.log
>>  >>>>>>>
>>  >>>>>>>
>>  >>>>>>> As you can see from the GC log, we never actually reach the point
>>  >>>>>>> where the CMS kicks in (after app startup).  But our young gens
>> seem
>>  >>>>>>> to take increasingly long to collect as time goes by.
>>  >>>>>>>
>>  >>>>>>> The steady state of the app is reached around 956.392 into the
>> log
>>  >>>>>>> with a collection that takes 0.106 seconds.  Thereafter the
>> survivor
>>  >>>>>>> space remains roughly constantly as filled and the amount
>> promoted to
>>  >>>>>>> old gen also remains constant, but the collection times increase
>> to
>>  >>>>>>> 2.855 seconds by the end of the 3.5 hour run.
>>  >>>>>>>
>>  >>>>>>> Has anyone seen this sort of behavior before?  Are there more
>>  >>>>>>> switches
>>  >>>>>>> that I should try running with?
>>  >>>>>>>
>>  >>>>>>> Obviously, I am working to profile the app and reduce the garbage
>>  >>>>>>> load
>>  >>>>>>> in parallel.  But if I still see this sort of problem, it is only
>> a
>>  >>>>>>> question of how long must the app run before I see unacceptable
>>  >>>>>>> latency spikes.
>>  >>>>>>>
>>  >>>>>>> Matt
>>  >>>>>>>
>>  >>>>>>> ________________________________
>>  >>>>>>> _______________________________________________
>>  >>>>>>> hotspot-gc-use mailing list
>>  >>>>>>> hotspot-gc-use at openjdk.java.net <mailto:
>> hotspot-gc-use at openjdk.java.net>
>>
>>  >>>>>>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>>  >>>>>>
>>  >>>>>> _______________________________________________
>>  >>>>>> hotspot-gc-use mailing list
>>  >>>>>> hotspot-gc-use at openjdk.java.net <mailto:
>> hotspot-gc-use at openjdk.java.net>
>>
>>  >>>>>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>>  >>
>>  >> _______________________________________________
>>  >> hotspot-gc-use mailing list
>>  >> hotspot-gc-use at openjdk.java.net <mailto:
>> hotspot-gc-use at openjdk.java.net>
>>
>>  >> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>>  >
>>  >
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20100603/a4e8f2a1/attachment-0001.html 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: gc.log.gz
Type: application/x-gzip
Size: 14612 bytes
Desc: not available
Url : http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20100603/a4e8f2a1/attachment-0001.bin 

From eric.caspole at amd.com  Thu Jun  3 13:57:07 2010
From: eric.caspole at amd.com (Eric Caspole)
Date: Thu, 3 Jun 2010 13:57:07 -0700
Subject: Growing GC Young Gen Times
In-Reply-To: <AANLkTimNDF68fcnuju-NlFd9CP0Kp15AnrgYjtCSCT-R@mail.gmail.com>
References: <AANLkTik0Ft_M27QE4OGUi0Ycn1Fe82bUkFw0znEgch09@mail.gmail.com>
	<4BEC2776.8010609@oracle.com>
	<AANLkTikQRchdpOHwMZfYi1sqYYgJBx0SAI_awaV_s1Ke@mail.gmail.com>
	<4BEC7498.6030405@oracle.com> <4BEC7D4D.2000905@oracle.com>
	<AANLkTil_AMIxaK-eKswAuaCIvhvakouu3GTWBS-ktj5x@mail.gmail.com>
	<4BED8A17.9090208@oracle.com> <4BED8BF9.7000803@oracle.com>
	<AANLkTiknmtrnka6XfnH6yZTjD4XlLltZatUquEedzpi0@mail.gmail.com>
	<4BEDA55D.5030703@oracle.com>
	<AANLkTimNDF68fcnuju-NlFd9CP0Kp15AnrgYjtCSCT-R@mail.gmail.com>
Message-ID: <DDAF3A0F-E240-4E8C-B50C-0B9FC4CFFFAD@amd.com>

Hi Matt,
I had a problem like this in a previous job. It turned out to be  
leaking jni handles. The jni handles get scanned during the GC even  
if they are stale/useless. So I would carefully inspect your JNI code  
for incorrect use of jni handles.

I debugged this problem by doing before/after oprofiles and comparing  
the runs to see what was taking more time in gc, if that is any help  
to you.
Regards,
Eric


On Jun 3, 2010, at 1:17 PM, Matt Fowles wrote:

> All~
>
> Today we were able to isolate the piece of our system that is  
> causing the increase in GC times.  We can now reproduce on multiple  
> machines in a much faster manner (about 20 minutes).  The attached  
> log represents a run from just this component.  There is a full GC  
> part way through the run that we forced using visual vm.  The  
> following interesting facts are presented for your amusement/ 
> edification:
>
> 1) the young gen pause times increase steadily over the course of  
> the run
> 2) the full GC doesn't effect the young gen pause times
> 3) the component in question makes heavy use of JNI
>
> I suspect the 3rd fact is the most interesting one.  We will  
> [obviously] be looking into reducing the test case further and  
> possibly fixing our JNI code.  But this represents a huge leap  
> forward in our understanding of the problem.
>
> Matt
>
>
> On Fri, May 14, 2010 at 3:32 PM, Y. Srinivas Ramakrishna  
> <y.s.ramakrishna at oracle.com> wrote:
> Matt -- Yes, comparative data for all these for 6u20 and jdk 7
> would be great. Naturally, server 1 is most immediately useful
> for determining if 6631166 addresses this at all,
> but others would be useful too if it turns out it doesn't
> (i.e. if jdk 7's server 1 turns out to be no better than 6u20's --
> at which point we should get this into the right channel -- open a  
> bug,
> and a support case).
>
> thanks.
> -- ramki
>
>
> On 05/14/10 12:24, Matt Fowles wrote:
> Ramki~
>
> I am preparing the flags for the next 3 runs (which run in  
> parallel) and wanted to check a few things with you.  I believe  
> that each of these is collecting a useful data point,
>
> Server 1 is running with 8 threads, reduced young gen, and MTT 1.
> Server 2 is running with 8 threads, reduced young gen, and MTT 1,  
> ParNew, but NOT CMS.
> Server 3 is running with 8 threads, reduced young gen, and MTT 1,  
> and PrintFLSStatistics.
>
> I can (additionally) run all of these tests on JDK7 (Java HotSpot 
> (TM) 64-Bit Server VM (build 17.0-b05, mixed mode)).
>
> Server 1:
>            -verbose:gc
>            -XX:+PrintGCTimeStamps
>            -XX:+PrintGCDetails
>            -XX:+PrintGCTaskTimeStamps
>            -XX:+PrintCommandLineFlags
>
>            -Xms32g -Xmx32g -Xmn1g
>            -XX:+UseParNewGC
>            -XX:ParallelGCThreads=8
>            -XX:+UseConcMarkSweepGC
>            -XX:ParallelCMSThreads=8
>            -XX:MaxTenuringThreshold=1
>            -XX:SurvivorRatio=14
>            -XX:+CMSParallelRemarkEnabled
>            -Xloggc:gc1.log
>            -XX:+UseLargePages             -XX:+AlwaysPreTouch
>
> Server 2:
>            -verbose:gc
>            -XX:+PrintGCTimeStamps
>            -XX:+PrintGCDetails
>            -XX:+PrintGCTaskTimeStamps
>            -XX:+PrintCommandLineFlags
>
>            -Xms32g -Xmx32g -Xmn1g
>            -XX:+UseParNewGC
>            -XX:ParallelGCThreads=8
>            -XX:MaxTenuringThreshold=1
>            -XX:SurvivorRatio=14
>            -Xloggc:gc2.log
>            -XX:+UseLargePages             -XX:+AlwaysPreTouch
>
>
> Server 3:
>            -verbose:gc
>            -XX:+PrintGCTimeStamps
>            -XX:+PrintGCDetails
>            -XX:+PrintGCTaskTimeStamps
>            -XX:+PrintCommandLineFlags
>
>            -Xms32g -Xmx32g -Xmn1g
>            -XX:+UseParNewGC
>            -XX:ParallelGCThreads=8
>            -XX:+UseConcMarkSweepGC
>            -XX:ParallelCMSThreads=8
>            -XX:MaxTenuringThreshold=1
>            -XX:SurvivorRatio=14
>            -XX:+CMSParallelRemarkEnabled
>            -Xloggc:gc3.log
>
>            -XX:PrintFLSStatistics=2
>            -XX:+UseLargePages
>            -XX:+AlwaysPreTouch
>  Matt
>
> On Fri, May 14, 2010 at 1:44 PM, Y. Srinivas Ramakrishna  
> <y.s.ramakrishna at oracle.com <mailto:y.s.ramakrishna at oracle.com>>  
> wrote:
>  > On 05/14/10 10:36, Y. Srinivas Ramakrishna wrote:
>  >>
>  >> On 05/14/10 10:24, Matt Fowles wrote:
>  >>>
>  >>> Jon~
>  >>>
>  >>> That makes, sense but the fact is that the old gen *never* get
>  >>> collected.  So all the allocations happen from the giant empty  
> space
>  >>> at the end of the free list.  I thought fragmentation only  
> occurred
>  >>> when the free lists are added to after freeing memory...
>  >>
>  >> As Jon indicated allocation is done from free lists of blocks
>  >> that are pre-carved on demand to avoid contention while  
> allocating.
>  >> The old heuristics for how large to make those lists and the
>  >> inventory to hold in those lists was not working well as you
>  >> scaled the number of workers. Following 6631166 we believe it
>  >> works better and causes both less contention and less
>  >> fragmentation than it did before, because we do not hold
>  >> unnecessary excess inventory of free blocks.
>  >
>  > To see what the fragmentation is, try -XX:PrintFLSStatistics=2.
>  > This will slow down your scavenge pauses (perhaps by quite a bit
>  > for your 26 GB heap), but you will get a report of the number of
>  > blocks on free lists and how fragmented the space is on that ccount
>  > (for some appropriate notion of fragmentation). Don't use that
>  > flag in production though :-)
>  >
>  > -- ramki
>  >
>  >>
>  >> The fragmentation in turn causes card-scanning to suffer
>  >> adversely, besides the issues with loss of spatial locality also
>  >> increasing cache misses and TLB misses. (The large page
>  >> option might help mitigate the latter a bit, especially
>  >> since you have such a large heap and our fragmented
>  >> allocation may be exacerbating the TLB pressure.)
>  >>
>  >> -- ramki
>  >>
>  >>> Matt
>  >>>
>  >>> On Thu, May 13, 2010 at 6:29 PM, Jon Masamitsu  
> <jon.masamitsu at oracle.com <mailto:jon.masamitsu at oracle.com>>
>
>  >>> wrote:
>  >>>>
>  >>>> Matt,
>  >>>>
>  >>>> To amplify on Ramki's comment, the allocations out of the
>  >>>> old generation are always from a free list.  During a young
>  >>>> generation collection each GC thread will get its own
>  >>>> local free lists from the old generation so that it can
>  >>>> copy objects to the old generation without synchronizing
>  >>>> with the other GC thread (most of the time).  Objects from
>  >>>> a GC thread's local free lists are pushed to the globals lists
>  >>>> after the collection (as far as I recall). So there is some
>  >>>> churn in the free lists.
>  >>>>
>  >>>> Jon
>  >>>>
>  >>>> On 05/13/10 14:52, Y. Srinivas Ramakrishna wrote:
>  >>>>>
>  >>>>> On 05/13/10 10:50, Matt Fowles wrote:
>  >>>>>>
>  >>>>>> Jon~
>  >>>>>>
>  >>>>>> This may sound naive, but how can fragmentation be an issue  
> if the old
>  >>>>>> gen has never been collected?  I would think we are still  
> in the space
>  >>>>>> where we can just bump the old gen alloc pointer...
>  >>>>>
>  >>>>> Matt, The old gen allocator may fragment the space.  
> Allocation is not
>  >>>>> exactly "bump a pointer".
>  >>>>>
>  >>>>> -- ramki
>  >>>>>
>  >>>>>> Matt
>  >>>>>>
>  >>>>>> On Thu, May 13, 2010 at 12:23 PM, Jon Masamitsu
>  >>>>>> <jon.masamitsu at oracle.com  
> <mailto:jon.masamitsu at oracle.com>> wrote:
>  >>>>>>>
>  >>>>>>> Matt,
>  >>>>>>>
>  >>>>>>> As Ramki indicated fragmentation might be an issue.  As the
>  >>>>>>> fragmentation
>  >>>>>>> in the old generation increases, it takes longer to find  
> space in the
>  >>>>>>> old
>  >>>>>>> generation
>  >>>>>>> into which to promote objects from the young generation.   
> This is
>  >>>>>>> apparently
>  >>>>>>> not
>  >>>>>>> the problem that Wayne is having but you still might be  
> hitting it.
>  >>>>>>>  If
>  >>>>>>> you
>  >>>>>>> can
>  >>>>>>> connect jconsole to the VM and force a full GC, that would  
> tell us if
>  >>>>>>> it's
>  >>>>>>> fragmentation.
>  >>>>>>>
>  >>>>>>> There might be a scaling issue with the UseParNewGC.  If  
> you can use
>  >>>>>>> -XX:-UseParNewGC (turning off the parallel young
>  >>>>>>> generation collection) with  -XX:+UseConcMarkSweepGC the  
> pauses
>  >>>>>>> will be longer but may be more stable.  That's not the  
> solution but
>  >>>>>>> just
>  >>>>>>> part
>  >>>>>>> of the investigation.
>  >>>>>>>
>  >>>>>>> You could try just -XX:+UseParNewGC without -XX: 
> +UseConcMarkSweepGC
>  >>>>>>> and if you don't see the growing young generation pause,  
> that would
>  >>>>>>> indicate
>  >>>>>>> something specific about promotion into the CMS generation.
>  >>>>>>>
>  >>>>>>> UseParallelGC is different from UseParNewGC in a number of  
> ways
>  >>>>>>> and if you try UseParallelGC and still see the growing young
>  >>>>>>> generation
>  >>>>>>> pauses, I'd suspect something special about your application.
>  >>>>>>>
>  >>>>>>> If you can run these experiments hopefully they will tell
>  >>>>>>> us where to look next.
>  >>>>>>>
>  >>>>>>> Jon
>  >>>>>>>
>  >>>>>>>
>  >>>>>>> On 05/12/10 15:19, Matt Fowles wrote:
>  >>>>>>>
>  >>>>>>> All~
>  >>>>>>>
>  >>>>>>> I have a large app that produces ~4g of garbage every 30  
> seconds and
>  >>>>>>> am trying to reduce the size of gc outliers.  About 99% of  
> this data
>  >>>>>>> is garbage, but almost anything that survives one  
> collection survives
>  >>>>>>> for an indeterminately long amount of time.  We are  
> currently using
>  >>>>>>> the following VM and options:
>  >>>>>>>
>  >>>>>>> java version "1.6.0_20"
>  >>>>>>> Java(TM) SE Runtime Environment (build 1.6.0_20-b02)
>  >>>>>>> Java HotSpot(TM) 64-Bit Server VM (build 16.3-b01, mixed  
> mode)
>  >>>>>>>
>  >>>>>>>              -verbose:gc
>  >>>>>>>              -XX:+PrintGCTimeStamps
>  >>>>>>>              -XX:+PrintGCDetails
>  >>>>>>>              -XX:+PrintGCTaskTimeStamps
>  >>>>>>>              -XX:+PrintTenuringDistribution
>  >>>>>>>              -XX:+PrintCommandLineFlags
>  >>>>>>>              -XX:+PrintReferenceGC
>  >>>>>>>              -Xms32g -Xmx32g -Xmn4g
>  >>>>>>>              -XX:+UseParNewGC
>  >>>>>>>              -XX:ParallelGCThreads=4
>  >>>>>>>              -XX:+UseConcMarkSweepGC
>  >>>>>>>              -XX:ParallelCMSThreads=4
>  >>>>>>>              -XX:CMSInitiatingOccupancyFraction=60
>  >>>>>>>              -XX:+UseCMSInitiatingOccupancyOnly
>  >>>>>>>              -XX:+CMSParallelRemarkEnabled
>  >>>>>>>              -XX:MaxGCPauseMillis=50
>  >>>>>>>              -Xloggc:gc.log
>  >>>>>>>
>  >>>>>>>
>  >>>>>>> As you can see from the GC log, we never actually reach  
> the point
>  >>>>>>> where the CMS kicks in (after app startup).  But our young  
> gens seem
>  >>>>>>> to take increasingly long to collect as time goes by.
>  >>>>>>>
>  >>>>>>> The steady state of the app is reached around 956.392 into  
> the log
>  >>>>>>> with a collection that takes 0.106 seconds.  Thereafter  
> the survivor
>  >>>>>>> space remains roughly constantly as filled and the amount  
> promoted to
>  >>>>>>> old gen also remains constant, but the collection times  
> increase to
>  >>>>>>> 2.855 seconds by the end of the 3.5 hour run.
>  >>>>>>>
>  >>>>>>> Has anyone seen this sort of behavior before?  Are there more
>  >>>>>>> switches
>  >>>>>>> that I should try running with?
>  >>>>>>>
>  >>>>>>> Obviously, I am working to profile the app and reduce the  
> garbage
>  >>>>>>> load
>  >>>>>>> in parallel.  But if I still see this sort of problem, it  
> is only a
>  >>>>>>> question of how long must the app run before I see  
> unacceptable
>  >>>>>>> latency spikes.
>  >>>>>>>
>  >>>>>>> Matt
>  >>>>>>>
>  >>>>>>> ________________________________
>  >>>>>>> _______________________________________________
>  >>>>>>> hotspot-gc-use mailing list
>  >>>>>>> hotspot-gc-use at openjdk.java.net <mailto:hotspot-gc- 
> use at openjdk.java.net>
>
>  >>>>>>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>  >>>>>>
>  >>>>>> _______________________________________________
>  >>>>>> hotspot-gc-use mailing list
>  >>>>>> hotspot-gc-use at openjdk.java.net <mailto:hotspot-gc- 
> use at openjdk.java.net>
>
>  >>>>>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>  >>
>  >> _______________________________________________
>  >> hotspot-gc-use mailing list
>  >> hotspot-gc-use at openjdk.java.net <mailto:hotspot-gc- 
> use at openjdk.java.net>
>
>  >> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>  >
>  >
>
>
>
> <gc.log.gz><ATT00001..txt>


From matt.fowles at gmail.com  Thu Jun  3 14:40:11 2010
From: matt.fowles at gmail.com (Matt Fowles)
Date: Thu, 3 Jun 2010 17:40:11 -0400
Subject: Growing GC Young Gen Times
In-Reply-To: <DDAF3A0F-E240-4E8C-B50C-0B9FC4CFFFAD@amd.com>
References: <AANLkTik0Ft_M27QE4OGUi0Ycn1Fe82bUkFw0znEgch09@mail.gmail.com> 
	<4BEC2776.8010609@oracle.com>
	<AANLkTikQRchdpOHwMZfYi1sqYYgJBx0SAI_awaV_s1Ke@mail.gmail.com> 
	<4BEC7498.6030405@oracle.com> <4BEC7D4D.2000905@oracle.com> 
	<AANLkTil_AMIxaK-eKswAuaCIvhvakouu3GTWBS-ktj5x@mail.gmail.com> 
	<4BED8A17.9090208@oracle.com> <4BED8BF9.7000803@oracle.com> 
	<AANLkTiknmtrnka6XfnH6yZTjD4XlLltZatUquEedzpi0@mail.gmail.com> 
	<4BEDA55D.5030703@oracle.com>
	<AANLkTimNDF68fcnuju-NlFd9CP0Kp15AnrgYjtCSCT-R@mail.gmail.com> 
	<DDAF3A0F-E240-4E8C-B50C-0B9FC4CFFFAD@amd.com>
Message-ID: <AANLkTimkSHAWuefeJtUTfGAKlpcIkUzHYk7uNw7DStqU@mail.gmail.com>

Eric~

That is my suspicion as well.  It would be nice if there were a flag or
something to print stats for JNI handles...

Matt

On Thu, Jun 3, 2010 at 4:57 PM, Eric Caspole <eric.caspole at amd.com> wrote:

> Hi Matt,
> I had a problem like this in a previous job. It turned out to be leaking
> jni handles. The jni handles get scanned during the GC even if they are
> stale/useless. So I would carefully inspect your JNI code for incorrect use
> of jni handles.
>
> I debugged this problem by doing before/after oprofiles and comparing the
> runs to see what was taking more time in gc, if that is any help to you.
> Regards,
> Eric
>
>
>
> On Jun 3, 2010, at 1:17 PM, Matt Fowles wrote:
>
>  All~
>>
>> Today we were able to isolate the piece of our system that is causing the
>> increase in GC times.  We can now reproduce on multiple machines in a much
>> faster manner (about 20 minutes).  The attached log represents a run from
>> just this component.  There is a full GC part way through the run that we
>> forced using visual vm.  The following interesting facts are presented for
>> your amusement/edification:
>>
>> 1) the young gen pause times increase steadily over the course of the run
>> 2) the full GC doesn't effect the young gen pause times
>> 3) the component in question makes heavy use of JNI
>>
>> I suspect the 3rd fact is the most interesting one.  We will [obviously]
>> be looking into reducing the test case further and possibly fixing our JNI
>> code.  But this represents a huge leap forward in our understanding of the
>> problem.
>>
>> Matt
>>
>>
>> On Fri, May 14, 2010 at 3:32 PM, Y. Srinivas Ramakrishna <
>> y.s.ramakrishna at oracle.com> wrote:
>> Matt -- Yes, comparative data for all these for 6u20 and jdk 7
>> would be great. Naturally, server 1 is most immediately useful
>> for determining if 6631166 addresses this at all,
>> but others would be useful too if it turns out it doesn't
>> (i.e. if jdk 7's server 1 turns out to be no better than 6u20's --
>> at which point we should get this into the right channel -- open a bug,
>> and a support case).
>>
>> thanks.
>> -- ramki
>>
>>
>> On 05/14/10 12:24, Matt Fowles wrote:
>> Ramki~
>>
>> I am preparing the flags for the next 3 runs (which run in parallel) and
>> wanted to check a few things with you.  I believe that each of these is
>> collecting a useful data point,
>>
>> Server 1 is running with 8 threads, reduced young gen, and MTT 1.
>> Server 2 is running with 8 threads, reduced young gen, and MTT 1, ParNew,
>> but NOT CMS.
>> Server 3 is running with 8 threads, reduced young gen, and MTT 1, and
>> PrintFLSStatistics.
>>
>> I can (additionally) run all of these tests on JDK7 (Java HotSpot(TM)
>> 64-Bit Server VM (build 17.0-b05, mixed mode)).
>>
>> Server 1:
>>           -verbose:gc
>>           -XX:+PrintGCTimeStamps
>>           -XX:+PrintGCDetails
>>           -XX:+PrintGCTaskTimeStamps
>>           -XX:+PrintCommandLineFlags
>>
>>           -Xms32g -Xmx32g -Xmn1g
>>           -XX:+UseParNewGC
>>           -XX:ParallelGCThreads=8
>>           -XX:+UseConcMarkSweepGC
>>           -XX:ParallelCMSThreads=8
>>           -XX:MaxTenuringThreshold=1
>>           -XX:SurvivorRatio=14
>>           -XX:+CMSParallelRemarkEnabled
>>           -Xloggc:gc1.log
>>           -XX:+UseLargePages             -XX:+AlwaysPreTouch
>>
>> Server 2:
>>           -verbose:gc
>>           -XX:+PrintGCTimeStamps
>>           -XX:+PrintGCDetails
>>           -XX:+PrintGCTaskTimeStamps
>>           -XX:+PrintCommandLineFlags
>>
>>           -Xms32g -Xmx32g -Xmn1g
>>           -XX:+UseParNewGC
>>           -XX:ParallelGCThreads=8
>>           -XX:MaxTenuringThreshold=1
>>           -XX:SurvivorRatio=14
>>           -Xloggc:gc2.log
>>           -XX:+UseLargePages             -XX:+AlwaysPreTouch
>>
>>
>> Server 3:
>>           -verbose:gc
>>           -XX:+PrintGCTimeStamps
>>           -XX:+PrintGCDetails
>>           -XX:+PrintGCTaskTimeStamps
>>           -XX:+PrintCommandLineFlags
>>
>>           -Xms32g -Xmx32g -Xmn1g
>>           -XX:+UseParNewGC
>>           -XX:ParallelGCThreads=8
>>           -XX:+UseConcMarkSweepGC
>>           -XX:ParallelCMSThreads=8
>>           -XX:MaxTenuringThreshold=1
>>           -XX:SurvivorRatio=14
>>           -XX:+CMSParallelRemarkEnabled
>>           -Xloggc:gc3.log
>>
>>           -XX:PrintFLSStatistics=2
>>           -XX:+UseLargePages
>>           -XX:+AlwaysPreTouch
>>  Matt
>>
>> On Fri, May 14, 2010 at 1:44 PM, Y. Srinivas Ramakrishna <
>> y.s.ramakrishna at oracle.com <mailto:y.s.ramakrishna at oracle.com>> wrote:
>>  > On 05/14/10 10:36, Y. Srinivas Ramakrishna wrote:
>>  >>
>>  >> On 05/14/10 10:24, Matt Fowles wrote:
>>  >>>
>>  >>> Jon~
>>  >>>
>>  >>> That makes, sense but the fact is that the old gen *never* get
>>  >>> collected.  So all the allocations happen from the giant empty space
>>  >>> at the end of the free list.  I thought fragmentation only occurred
>>  >>> when the free lists are added to after freeing memory...
>>  >>
>>  >> As Jon indicated allocation is done from free lists of blocks
>>  >> that are pre-carved on demand to avoid contention while allocating.
>>  >> The old heuristics for how large to make those lists and the
>>  >> inventory to hold in those lists was not working well as you
>>  >> scaled the number of workers. Following 6631166 we believe it
>>  >> works better and causes both less contention and less
>>  >> fragmentation than it did before, because we do not hold
>>  >> unnecessary excess inventory of free blocks.
>>  >
>>  > To see what the fragmentation is, try -XX:PrintFLSStatistics=2.
>>  > This will slow down your scavenge pauses (perhaps by quite a bit
>>  > for your 26 GB heap), but you will get a report of the number of
>>  > blocks on free lists and how fragmented the space is on that ccount
>>  > (for some appropriate notion of fragmentation). Don't use that
>>  > flag in production though :-)
>>  >
>>  > -- ramki
>>  >
>>  >>
>>  >> The fragmentation in turn causes card-scanning to suffer
>>  >> adversely, besides the issues with loss of spatial locality also
>>  >> increasing cache misses and TLB misses. (The large page
>>  >> option might help mitigate the latter a bit, especially
>>  >> since you have such a large heap and our fragmented
>>  >> allocation may be exacerbating the TLB pressure.)
>>  >>
>>  >> -- ramki
>>  >>
>>  >>> Matt
>>  >>>
>>  >>> On Thu, May 13, 2010 at 6:29 PM, Jon Masamitsu <
>> jon.masamitsu at oracle.com <mailto:jon.masamitsu at oracle.com>>
>>
>>  >>> wrote:
>>  >>>>
>>  >>>> Matt,
>>  >>>>
>>  >>>> To amplify on Ramki's comment, the allocations out of the
>>  >>>> old generation are always from a free list.  During a young
>>  >>>> generation collection each GC thread will get its own
>>  >>>> local free lists from the old generation so that it can
>>  >>>> copy objects to the old generation without synchronizing
>>  >>>> with the other GC thread (most of the time).  Objects from
>>  >>>> a GC thread's local free lists are pushed to the globals lists
>>  >>>> after the collection (as far as I recall). So there is some
>>  >>>> churn in the free lists.
>>  >>>>
>>  >>>> Jon
>>  >>>>
>>  >>>> On 05/13/10 14:52, Y. Srinivas Ramakrishna wrote:
>>  >>>>>
>>  >>>>> On 05/13/10 10:50, Matt Fowles wrote:
>>  >>>>>>
>>  >>>>>> Jon~
>>  >>>>>>
>>  >>>>>> This may sound naive, but how can fragmentation be an issue if the
>> old
>>  >>>>>> gen has never been collected?  I would think we are still in the
>> space
>>  >>>>>> where we can just bump the old gen alloc pointer...
>>  >>>>>
>>  >>>>> Matt, The old gen allocator may fragment the space. Allocation is
>> not
>>  >>>>> exactly "bump a pointer".
>>  >>>>>
>>  >>>>> -- ramki
>>  >>>>>
>>  >>>>>> Matt
>>  >>>>>>
>>  >>>>>> On Thu, May 13, 2010 at 12:23 PM, Jon Masamitsu
>>  >>>>>> <jon.masamitsu at oracle.com <mailto:jon.masamitsu at oracle.com>>
>> wrote:
>>  >>>>>>>
>>  >>>>>>> Matt,
>>  >>>>>>>
>>  >>>>>>> As Ramki indicated fragmentation might be an issue.  As the
>>  >>>>>>> fragmentation
>>  >>>>>>> in the old generation increases, it takes longer to find space in
>> the
>>  >>>>>>> old
>>  >>>>>>> generation
>>  >>>>>>> into which to promote objects from the young generation.  This is
>>  >>>>>>> apparently
>>  >>>>>>> not
>>  >>>>>>> the problem that Wayne is having but you still might be hitting
>> it.
>>  >>>>>>>  If
>>  >>>>>>> you
>>  >>>>>>> can
>>  >>>>>>> connect jconsole to the VM and force a full GC, that would tell
>> us if
>>  >>>>>>> it's
>>  >>>>>>> fragmentation.
>>  >>>>>>>
>>  >>>>>>> There might be a scaling issue with the UseParNewGC.  If you can
>> use
>>  >>>>>>> -XX:-UseParNewGC (turning off the parallel young
>>  >>>>>>> generation collection) with  -XX:+UseConcMarkSweepGC the pauses
>>  >>>>>>> will be longer but may be more stable.  That's not the solution
>> but
>>  >>>>>>> just
>>  >>>>>>> part
>>  >>>>>>> of the investigation.
>>  >>>>>>>
>>  >>>>>>> You could try just -XX:+UseParNewGC without
>> -XX:+UseConcMarkSweepGC
>>  >>>>>>> and if you don't see the growing young generation pause, that
>> would
>>  >>>>>>> indicate
>>  >>>>>>> something specific about promotion into the CMS generation.
>>  >>>>>>>
>>  >>>>>>> UseParallelGC is different from UseParNewGC in a number of ways
>>  >>>>>>> and if you try UseParallelGC and still see the growing young
>>  >>>>>>> generation
>>  >>>>>>> pauses, I'd suspect something special about your application.
>>  >>>>>>>
>>  >>>>>>> If you can run these experiments hopefully they will tell
>>  >>>>>>> us where to look next.
>>  >>>>>>>
>>  >>>>>>> Jon
>>  >>>>>>>
>>  >>>>>>>
>>  >>>>>>> On 05/12/10 15:19, Matt Fowles wrote:
>>  >>>>>>>
>>  >>>>>>> All~
>>  >>>>>>>
>>  >>>>>>> I have a large app that produces ~4g of garbage every 30 seconds
>> and
>>  >>>>>>> am trying to reduce the size of gc outliers.  About 99% of this
>> data
>>  >>>>>>> is garbage, but almost anything that survives one collection
>> survives
>>  >>>>>>> for an indeterminately long amount of time.  We are currently
>> using
>>  >>>>>>> the following VM and options:
>>  >>>>>>>
>>  >>>>>>> java version "1.6.0_20"
>>  >>>>>>> Java(TM) SE Runtime Environment (build 1.6.0_20-b02)
>>  >>>>>>> Java HotSpot(TM) 64-Bit Server VM (build 16.3-b01, mixed mode)
>>  >>>>>>>
>>  >>>>>>>              -verbose:gc
>>  >>>>>>>              -XX:+PrintGCTimeStamps
>>  >>>>>>>              -XX:+PrintGCDetails
>>  >>>>>>>              -XX:+PrintGCTaskTimeStamps
>>  >>>>>>>              -XX:+PrintTenuringDistribution
>>  >>>>>>>              -XX:+PrintCommandLineFlags
>>  >>>>>>>              -XX:+PrintReferenceGC
>>  >>>>>>>              -Xms32g -Xmx32g -Xmn4g
>>  >>>>>>>              -XX:+UseParNewGC
>>  >>>>>>>              -XX:ParallelGCThreads=4
>>  >>>>>>>              -XX:+UseConcMarkSweepGC
>>  >>>>>>>              -XX:ParallelCMSThreads=4
>>  >>>>>>>              -XX:CMSInitiatingOccupancyFraction=60
>>  >>>>>>>              -XX:+UseCMSInitiatingOccupancyOnly
>>  >>>>>>>              -XX:+CMSParallelRemarkEnabled
>>  >>>>>>>              -XX:MaxGCPauseMillis=50
>>  >>>>>>>              -Xloggc:gc.log
>>  >>>>>>>
>>  >>>>>>>
>>  >>>>>>> As you can see from the GC log, we never actually reach the point
>>  >>>>>>> where the CMS kicks in (after app startup).  But our young gens
>> seem
>>  >>>>>>> to take increasingly long to collect as time goes by.
>>  >>>>>>>
>>  >>>>>>> The steady state of the app is reached around 956.392 into the
>> log
>>  >>>>>>> with a collection that takes 0.106 seconds.  Thereafter the
>> survivor
>>  >>>>>>> space remains roughly constantly as filled and the amount
>> promoted to
>>  >>>>>>> old gen also remains constant, but the collection times increase
>> to
>>  >>>>>>> 2.855 seconds by the end of the 3.5 hour run.
>>  >>>>>>>
>>  >>>>>>> Has anyone seen this sort of behavior before?  Are there more
>>  >>>>>>> switches
>>  >>>>>>> that I should try running with?
>>  >>>>>>>
>>  >>>>>>> Obviously, I am working to profile the app and reduce the garbage
>>  >>>>>>> load
>>  >>>>>>> in parallel.  But if I still see this sort of problem, it is only
>> a
>>  >>>>>>> question of how long must the app run before I see unacceptable
>>  >>>>>>> latency spikes.
>>  >>>>>>>
>>  >>>>>>> Matt
>>  >>>>>>>
>>  >>>>>>> ________________________________
>>  >>>>>>> _______________________________________________
>>  >>>>>>> hotspot-gc-use mailing list
>>  >>>>>>> hotspot-gc-use at openjdk.java.net <mailto:
>> hotspot-gc-use at openjdk.java.net>
>>
>>  >>>>>>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>>  >>>>>>
>>  >>>>>> _______________________________________________
>>  >>>>>> hotspot-gc-use mailing list
>>  >>>>>> hotspot-gc-use at openjdk.java.net <mailto:
>> hotspot-gc-use at openjdk.java.net>
>>
>>  >>>>>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>>  >>
>>  >> _______________________________________________
>>  >> hotspot-gc-use mailing list
>>  >> hotspot-gc-use at openjdk.java.net <mailto:
>> hotspot-gc-use at openjdk.java.net>
>>
>>  >> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>>  >
>>  >
>>
>>
>>
>> <gc.log.gz><ATT00001..txt>
>>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20100603/2cdc3a9f/attachment-0001.html 

From matt.fowles at gmail.com  Fri Jun  4 14:11:04 2010
From: matt.fowles at gmail.com (Matt Fowles)
Date: Fri, 4 Jun 2010 17:11:04 -0400
Subject: Growing GC Young Gen Times
In-Reply-To: <AANLkTimkSHAWuefeJtUTfGAKlpcIkUzHYk7uNw7DStqU@mail.gmail.com>
References: <AANLkTik0Ft_M27QE4OGUi0Ycn1Fe82bUkFw0znEgch09@mail.gmail.com> 
	<4BEC2776.8010609@oracle.com>
	<AANLkTikQRchdpOHwMZfYi1sqYYgJBx0SAI_awaV_s1Ke@mail.gmail.com> 
	<4BEC7498.6030405@oracle.com> <4BEC7D4D.2000905@oracle.com> 
	<AANLkTil_AMIxaK-eKswAuaCIvhvakouu3GTWBS-ktj5x@mail.gmail.com> 
	<4BED8A17.9090208@oracle.com> <4BED8BF9.7000803@oracle.com> 
	<AANLkTiknmtrnka6XfnH6yZTjD4XlLltZatUquEedzpi0@mail.gmail.com> 
	<4BEDA55D.5030703@oracle.com>
	<AANLkTimNDF68fcnuju-NlFd9CP0Kp15AnrgYjtCSCT-R@mail.gmail.com> 
	<DDAF3A0F-E240-4E8C-B50C-0B9FC4CFFFAD@amd.com>
	<AANLkTimkSHAWuefeJtUTfGAKlpcIkUzHYk7uNw7DStqU@mail.gmail.com>
Message-ID: <AANLkTil2Sb7db9t9SJC8A7VzhugzfJr3hFgWIVfVI3pO@mail.gmail.com>

All~

Attached is a very small reproduction script.  From the included RE

1) set JAVA_HOME
2) build with `make`
3) run with

java -verbose:gc -XX:+PrintCommandLineFlags -XX:+PrintGCDetails
-XX:+PrintGCTaskTimeStamps -XX:+PrintGCTimeStamps -XX:+UseConcMarkSweepGC
-XX:+UseParNewGC -Xms1g -Xmx1g -Xmn10m -Djava.library.path=lib -jar
tester.jar

4) notice the growing GC times


>From having experimented with this, the key is the
jvm->AttachCurrentThread()/jvm->DetachCurrentThread() calls when combined
with the  env->GetObjectClass() and env->GetMethodID() calls.

If you attach and detatch within the loop, the GC times do not grow without
bound.  If you cache the results of GetMethodID outside the loop, the
results do not grow without bound.

Looking at src/share/vm/runtime/handles.cpp:76, you see the comment

// The thread local handle areas should not get very large

However, src/share/vm/prims/jni.cpp:1229 allocates a KlassHandle in each
call to GetMethodID, which seems to leak as it is never freed from the
HandleArea of the thread...


Obviously, I can fix my code to cache its jmethodID's and/or release threads
more often (which I have done and it seems to fix my problem).  But this
seems a surprising side effect of GetMethodID()...

Matt

On Thu, Jun 3, 2010 at 5:40 PM, Matt Fowles <matt.fowles at gmail.com> wrote:

> Eric~
>
> That is my suspicion as well.  It would be nice if there were a flag or
> something to print stats for JNI handles...
>
> Matt
>
>
> On Thu, Jun 3, 2010 at 4:57 PM, Eric Caspole <eric.caspole at amd.com> wrote:
>
>> Hi Matt,
>> I had a problem like this in a previous job. It turned out to be leaking
>> jni handles. The jni handles get scanned during the GC even if they are
>> stale/useless. So I would carefully inspect your JNI code for incorrect use
>> of jni handles.
>>
>> I debugged this problem by doing before/after oprofiles and comparing the
>> runs to see what was taking more time in gc, if that is any help to you.
>> Regards,
>> Eric
>>
>>
>>
>> On Jun 3, 2010, at 1:17 PM, Matt Fowles wrote:
>>
>>  All~
>>>
>>> Today we were able to isolate the piece of our system that is causing the
>>> increase in GC times.  We can now reproduce on multiple machines in a much
>>> faster manner (about 20 minutes).  The attached log represents a run from
>>> just this component.  There is a full GC part way through the run that we
>>> forced using visual vm.  The following interesting facts are presented for
>>> your amusement/edification:
>>>
>>> 1) the young gen pause times increase steadily over the course of the run
>>> 2) the full GC doesn't effect the young gen pause times
>>> 3) the component in question makes heavy use of JNI
>>>
>>> I suspect the 3rd fact is the most interesting one.  We will [obviously]
>>> be looking into reducing the test case further and possibly fixing our JNI
>>> code.  But this represents a huge leap forward in our understanding of the
>>> problem.
>>>
>>> Matt
>>>
>>>
>>> On Fri, May 14, 2010 at 3:32 PM, Y. Srinivas Ramakrishna <
>>> y.s.ramakrishna at oracle.com> wrote:
>>> Matt -- Yes, comparative data for all these for 6u20 and jdk 7
>>> would be great. Naturally, server 1 is most immediately useful
>>> for determining if 6631166 addresses this at all,
>>> but others would be useful too if it turns out it doesn't
>>> (i.e. if jdk 7's server 1 turns out to be no better than 6u20's --
>>> at which point we should get this into the right channel -- open a bug,
>>> and a support case).
>>>
>>> thanks.
>>> -- ramki
>>>
>>>
>>> On 05/14/10 12:24, Matt Fowles wrote:
>>> Ramki~
>>>
>>> I am preparing the flags for the next 3 runs (which run in parallel) and
>>> wanted to check a few things with you.  I believe that each of these is
>>> collecting a useful data point,
>>>
>>> Server 1 is running with 8 threads, reduced young gen, and MTT 1.
>>> Server 2 is running with 8 threads, reduced young gen, and MTT 1, ParNew,
>>> but NOT CMS.
>>> Server 3 is running with 8 threads, reduced young gen, and MTT 1, and
>>> PrintFLSStatistics.
>>>
>>> I can (additionally) run all of these tests on JDK7 (Java HotSpot(TM)
>>> 64-Bit Server VM (build 17.0-b05, mixed mode)).
>>>
>>> Server 1:
>>>           -verbose:gc
>>>           -XX:+PrintGCTimeStamps
>>>           -XX:+PrintGCDetails
>>>           -XX:+PrintGCTaskTimeStamps
>>>           -XX:+PrintCommandLineFlags
>>>
>>>           -Xms32g -Xmx32g -Xmn1g
>>>           -XX:+UseParNewGC
>>>           -XX:ParallelGCThreads=8
>>>           -XX:+UseConcMarkSweepGC
>>>           -XX:ParallelCMSThreads=8
>>>           -XX:MaxTenuringThreshold=1
>>>           -XX:SurvivorRatio=14
>>>           -XX:+CMSParallelRemarkEnabled
>>>           -Xloggc:gc1.log
>>>           -XX:+UseLargePages             -XX:+AlwaysPreTouch
>>>
>>> Server 2:
>>>           -verbose:gc
>>>           -XX:+PrintGCTimeStamps
>>>           -XX:+PrintGCDetails
>>>           -XX:+PrintGCTaskTimeStamps
>>>           -XX:+PrintCommandLineFlags
>>>
>>>           -Xms32g -Xmx32g -Xmn1g
>>>           -XX:+UseParNewGC
>>>           -XX:ParallelGCThreads=8
>>>           -XX:MaxTenuringThreshold=1
>>>           -XX:SurvivorRatio=14
>>>           -Xloggc:gc2.log
>>>           -XX:+UseLargePages             -XX:+AlwaysPreTouch
>>>
>>>
>>> Server 3:
>>>           -verbose:gc
>>>           -XX:+PrintGCTimeStamps
>>>           -XX:+PrintGCDetails
>>>           -XX:+PrintGCTaskTimeStamps
>>>           -XX:+PrintCommandLineFlags
>>>
>>>           -Xms32g -Xmx32g -Xmn1g
>>>           -XX:+UseParNewGC
>>>           -XX:ParallelGCThreads=8
>>>           -XX:+UseConcMarkSweepGC
>>>           -XX:ParallelCMSThreads=8
>>>           -XX:MaxTenuringThreshold=1
>>>           -XX:SurvivorRatio=14
>>>           -XX:+CMSParallelRemarkEnabled
>>>           -Xloggc:gc3.log
>>>
>>>           -XX:PrintFLSStatistics=2
>>>           -XX:+UseLargePages
>>>           -XX:+AlwaysPreTouch
>>>  Matt
>>>
>>> On Fri, May 14, 2010 at 1:44 PM, Y. Srinivas Ramakrishna <
>>> y.s.ramakrishna at oracle.com <mailto:y.s.ramakrishna at oracle.com>> wrote:
>>>  > On 05/14/10 10:36, Y. Srinivas Ramakrishna wrote:
>>>  >>
>>>  >> On 05/14/10 10:24, Matt Fowles wrote:
>>>  >>>
>>>  >>> Jon~
>>>  >>>
>>>  >>> That makes, sense but the fact is that the old gen *never* get
>>>  >>> collected.  So all the allocations happen from the giant empty space
>>>  >>> at the end of the free list.  I thought fragmentation only occurred
>>>  >>> when the free lists are added to after freeing memory...
>>>  >>
>>>  >> As Jon indicated allocation is done from free lists of blocks
>>>  >> that are pre-carved on demand to avoid contention while allocating.
>>>  >> The old heuristics for how large to make those lists and the
>>>  >> inventory to hold in those lists was not working well as you
>>>  >> scaled the number of workers. Following 6631166 we believe it
>>>  >> works better and causes both less contention and less
>>>  >> fragmentation than it did before, because we do not hold
>>>  >> unnecessary excess inventory of free blocks.
>>>  >
>>>  > To see what the fragmentation is, try -XX:PrintFLSStatistics=2.
>>>  > This will slow down your scavenge pauses (perhaps by quite a bit
>>>  > for your 26 GB heap), but you will get a report of the number of
>>>  > blocks on free lists and how fragmented the space is on that ccount
>>>  > (for some appropriate notion of fragmentation). Don't use that
>>>  > flag in production though :-)
>>>  >
>>>  > -- ramki
>>>  >
>>>  >>
>>>  >> The fragmentation in turn causes card-scanning to suffer
>>>  >> adversely, besides the issues with loss of spatial locality also
>>>  >> increasing cache misses and TLB misses. (The large page
>>>  >> option might help mitigate the latter a bit, especially
>>>  >> since you have such a large heap and our fragmented
>>>  >> allocation may be exacerbating the TLB pressure.)
>>>  >>
>>>  >> -- ramki
>>>  >>
>>>  >>> Matt
>>>  >>>
>>>  >>> On Thu, May 13, 2010 at 6:29 PM, Jon Masamitsu <
>>> jon.masamitsu at oracle.com <mailto:jon.masamitsu at oracle.com>>
>>>
>>>  >>> wrote:
>>>  >>>>
>>>  >>>> Matt,
>>>  >>>>
>>>  >>>> To amplify on Ramki's comment, the allocations out of the
>>>  >>>> old generation are always from a free list.  During a young
>>>  >>>> generation collection each GC thread will get its own
>>>  >>>> local free lists from the old generation so that it can
>>>  >>>> copy objects to the old generation without synchronizing
>>>  >>>> with the other GC thread (most of the time).  Objects from
>>>  >>>> a GC thread's local free lists are pushed to the globals lists
>>>  >>>> after the collection (as far as I recall). So there is some
>>>  >>>> churn in the free lists.
>>>  >>>>
>>>  >>>> Jon
>>>  >>>>
>>>  >>>> On 05/13/10 14:52, Y. Srinivas Ramakrishna wrote:
>>>  >>>>>
>>>  >>>>> On 05/13/10 10:50, Matt Fowles wrote:
>>>  >>>>>>
>>>  >>>>>> Jon~
>>>  >>>>>>
>>>  >>>>>> This may sound naive, but how can fragmentation be an issue if
>>> the old
>>>  >>>>>> gen has never been collected?  I would think we are still in the
>>> space
>>>  >>>>>> where we can just bump the old gen alloc pointer...
>>>  >>>>>
>>>  >>>>> Matt, The old gen allocator may fragment the space. Allocation is
>>> not
>>>  >>>>> exactly "bump a pointer".
>>>  >>>>>
>>>  >>>>> -- ramki
>>>  >>>>>
>>>  >>>>>> Matt
>>>  >>>>>>
>>>  >>>>>> On Thu, May 13, 2010 at 12:23 PM, Jon Masamitsu
>>>  >>>>>> <jon.masamitsu at oracle.com <mailto:jon.masamitsu at oracle.com>>
>>> wrote:
>>>  >>>>>>>
>>>  >>>>>>> Matt,
>>>  >>>>>>>
>>>  >>>>>>> As Ramki indicated fragmentation might be an issue.  As the
>>>  >>>>>>> fragmentation
>>>  >>>>>>> in the old generation increases, it takes longer to find space
>>> in the
>>>  >>>>>>> old
>>>  >>>>>>> generation
>>>  >>>>>>> into which to promote objects from the young generation.  This
>>> is
>>>  >>>>>>> apparently
>>>  >>>>>>> not
>>>  >>>>>>> the problem that Wayne is having but you still might be hitting
>>> it.
>>>  >>>>>>>  If
>>>  >>>>>>> you
>>>  >>>>>>> can
>>>  >>>>>>> connect jconsole to the VM and force a full GC, that would tell
>>> us if
>>>  >>>>>>> it's
>>>  >>>>>>> fragmentation.
>>>  >>>>>>>
>>>  >>>>>>> There might be a scaling issue with the UseParNewGC.  If you can
>>> use
>>>  >>>>>>> -XX:-UseParNewGC (turning off the parallel young
>>>  >>>>>>> generation collection) with  -XX:+UseConcMarkSweepGC the pauses
>>>  >>>>>>> will be longer but may be more stable.  That's not the solution
>>> but
>>>  >>>>>>> just
>>>  >>>>>>> part
>>>  >>>>>>> of the investigation.
>>>  >>>>>>>
>>>  >>>>>>> You could try just -XX:+UseParNewGC without
>>> -XX:+UseConcMarkSweepGC
>>>  >>>>>>> and if you don't see the growing young generation pause, that
>>> would
>>>  >>>>>>> indicate
>>>  >>>>>>> something specific about promotion into the CMS generation.
>>>  >>>>>>>
>>>  >>>>>>> UseParallelGC is different from UseParNewGC in a number of ways
>>>  >>>>>>> and if you try UseParallelGC and still see the growing young
>>>  >>>>>>> generation
>>>  >>>>>>> pauses, I'd suspect something special about your application.
>>>  >>>>>>>
>>>  >>>>>>> If you can run these experiments hopefully they will tell
>>>  >>>>>>> us where to look next.
>>>  >>>>>>>
>>>  >>>>>>> Jon
>>>  >>>>>>>
>>>  >>>>>>>
>>>  >>>>>>> On 05/12/10 15:19, Matt Fowles wrote:
>>>  >>>>>>>
>>>  >>>>>>> All~
>>>  >>>>>>>
>>>  >>>>>>> I have a large app that produces ~4g of garbage every 30 seconds
>>> and
>>>  >>>>>>> am trying to reduce the size of gc outliers.  About 99% of this
>>> data
>>>  >>>>>>> is garbage, but almost anything that survives one collection
>>> survives
>>>  >>>>>>> for an indeterminately long amount of time.  We are currently
>>> using
>>>  >>>>>>> the following VM and options:
>>>  >>>>>>>
>>>  >>>>>>> java version "1.6.0_20"
>>>  >>>>>>> Java(TM) SE Runtime Environment (build 1.6.0_20-b02)
>>>  >>>>>>> Java HotSpot(TM) 64-Bit Server VM (build 16.3-b01, mixed mode)
>>>  >>>>>>>
>>>  >>>>>>>              -verbose:gc
>>>  >>>>>>>              -XX:+PrintGCTimeStamps
>>>  >>>>>>>              -XX:+PrintGCDetails
>>>  >>>>>>>              -XX:+PrintGCTaskTimeStamps
>>>  >>>>>>>              -XX:+PrintTenuringDistribution
>>>  >>>>>>>              -XX:+PrintCommandLineFlags
>>>  >>>>>>>              -XX:+PrintReferenceGC
>>>  >>>>>>>              -Xms32g -Xmx32g -Xmn4g
>>>  >>>>>>>              -XX:+UseParNewGC
>>>  >>>>>>>              -XX:ParallelGCThreads=4
>>>  >>>>>>>              -XX:+UseConcMarkSweepGC
>>>  >>>>>>>              -XX:ParallelCMSThreads=4
>>>  >>>>>>>              -XX:CMSInitiatingOccupancyFraction=60
>>>  >>>>>>>              -XX:+UseCMSInitiatingOccupancyOnly
>>>  >>>>>>>              -XX:+CMSParallelRemarkEnabled
>>>  >>>>>>>              -XX:MaxGCPauseMillis=50
>>>  >>>>>>>              -Xloggc:gc.log
>>>  >>>>>>>
>>>  >>>>>>>
>>>  >>>>>>> As you can see from the GC log, we never actually reach the
>>> point
>>>  >>>>>>> where the CMS kicks in (after app startup).  But our young gens
>>> seem
>>>  >>>>>>> to take increasingly long to collect as time goes by.
>>>  >>>>>>>
>>>  >>>>>>> The steady state of the app is reached around 956.392 into the
>>> log
>>>  >>>>>>> with a collection that takes 0.106 seconds.  Thereafter the
>>> survivor
>>>  >>>>>>> space remains roughly constantly as filled and the amount
>>> promoted to
>>>  >>>>>>> old gen also remains constant, but the collection times increase
>>> to
>>>  >>>>>>> 2.855 seconds by the end of the 3.5 hour run.
>>>  >>>>>>>
>>>  >>>>>>> Has anyone seen this sort of behavior before?  Are there more
>>>  >>>>>>> switches
>>>  >>>>>>> that I should try running with?
>>>  >>>>>>>
>>>  >>>>>>> Obviously, I am working to profile the app and reduce the
>>> garbage
>>>  >>>>>>> load
>>>  >>>>>>> in parallel.  But if I still see this sort of problem, it is
>>> only a
>>>  >>>>>>> question of how long must the app run before I see unacceptable
>>>  >>>>>>> latency spikes.
>>>  >>>>>>>
>>>  >>>>>>> Matt
>>>  >>>>>>>
>>>  >>>>>>> ________________________________
>>>  >>>>>>> _______________________________________________
>>>  >>>>>>> hotspot-gc-use mailing list
>>>  >>>>>>> hotspot-gc-use at openjdk.java.net <mailto:
>>> hotspot-gc-use at openjdk.java.net>
>>>
>>>  >>>>>>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>>>  >>>>>>
>>>  >>>>>> _______________________________________________
>>>  >>>>>> hotspot-gc-use mailing list
>>>  >>>>>> hotspot-gc-use at openjdk.java.net <mailto:
>>> hotspot-gc-use at openjdk.java.net>
>>>
>>>  >>>>>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>>>  >>
>>>  >> _______________________________________________
>>>  >> hotspot-gc-use mailing list
>>>  >> hotspot-gc-use at openjdk.java.net <mailto:
>>> hotspot-gc-use at openjdk.java.net>
>>>
>>>  >> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>>>  >
>>>  >
>>>
>>>
>>>
>>> <gc.log.gz><ATT00001..txt>
>>>
>>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20100604/62724d5b/attachment-0001.html 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: GC_Growth.tgz
Type: application/x-gzip
Size: 37339 bytes
Desc: not available
Url : http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20100604/62724d5b/attachment-0001.bin 

From jon.masamitsu at oracle.com  Tue Jun  8 12:22:15 2010
From: jon.masamitsu at oracle.com (Jon Masamitsu)
Date: Tue, 08 Jun 2010 12:22:15 -0700
Subject: Growing GC Young Gen Times
In-Reply-To: <AANLkTil2Sb7db9t9SJC8A7VzhugzfJr3hFgWIVfVI3pO@mail.gmail.com>
References: <AANLkTik0Ft_M27QE4OGUi0Ycn1Fe82bUkFw0znEgch09@mail.gmail.com>
	<4BEC2776.8010609@oracle.com>	<AANLkTikQRchdpOHwMZfYi1sqYYgJBx0SAI_awaV_s1Ke@mail.gmail.com>
	<4BEC7498.6030405@oracle.com> <4BEC7D4D.2000905@oracle.com>
	<AANLkTil_AMIxaK-eKswAuaCIvhvakouu3GTWBS-ktj5x@mail.gmail.com>
	<4BED8A17.9090208@oracle.com> <4BED8BF9.7000803@oracle.com>
	<AANLkTiknmtrnka6XfnH6yZTjD4XlLltZatUquEedzpi0@mail.gmail.com>
	<4BEDA55D.5030703@oracle.com>	<AANLkTimNDF68fcnuju-NlFd9CP0Kp15AnrgYjtCSCT-R@mail.gmail.com>
	<DDAF3A0F-E240-4E8C-B50C-0B9FC4CFFFAD@amd.com>	<AANLkTimkSHAWuefeJtUTfGAKlpcIkUzHYk7uNw7DStqU@mail.gmail.com>
	<AANLkTil2Sb7db9t9SJC8A7VzhugzfJr3hFgWIVfVI3pO@mail.gmail.com>
Message-ID: <4C0E9867.5090900@oracle.com>

Matt,

I've file a bug for this.

6959511: KlassHandle leak in runtime

Thanks.

Jon

On 06/04/10 14:11, Matt Fowles wrote:
> All~
>
> Attached is a very small reproduction script.  From the included RE
>
> 1) set JAVA_HOME
> 2) build with `make`
> 3) run with
>
> java -verbose:gc -XX:+PrintCommandLineFlags -XX:+PrintGCDetails 
> -XX:+PrintGCTaskTimeStamps -XX:+PrintGCTimeStamps 
> -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -Xms1g -Xmx1g -Xmn10m 
> -Djava.library.path=lib -jar tester.jar 
>
> 4) notice the growing GC times
>
>
> From having experimented with this, the key is the 
> jvm->AttachCurrentThread()/jvm->DetachCurrentThread() calls when 
> combined with the  env->GetObjectClass() and env->GetMethodID() calls.
>
> If you attach and detatch within the loop, the GC times do not grow 
> without bound.  If you cache the results of GetMethodID outside the 
> loop, the results do not grow without bound.
>
> Looking at src/share/vm/runtime/handles.cpp:76, you see the comment
>
> // The thread local handle areas should not get very large
>
> However, src/share/vm/prims/jni.cpp:1229 allocates a KlassHandle in 
> each call to GetMethodID, which seems to leak as it is never freed 
> from the HandleArea of the thread...
>
>
> Obviously, I can fix my code to cache its jmethodID's and/or release 
> threads more often (which I have done and it seems to fix my problem). 
>  But this seems a surprising side effect of GetMethodID()...
>
> Matt
>

From peter.schuller at infidyne.com  Thu Jun 10 15:51:18 2010
From: peter.schuller at infidyne.com (Peter Schuller)
Date: Fri, 11 Jun 2010 00:51:18 +0200
Subject: g1 not doing partial aggressively enough -> fallback to full gc
In-Reply-To: <AANLkTimb9E8fXhxD2bZxOdodz9KO3j_3bBoyebVU0tOO@mail.gmail.com>
References: <AANLkTimb9E8fXhxD2bZxOdodz9KO3j_3bBoyebVU0tOO@mail.gmail.com>
Message-ID: <AANLkTimjuM8ytO-Wp0H_V2ByPkih4IIydgYUlTfDWaiY@mail.gmail.com>

> regions. Based on on the heap size after the full GC, the live working
> set if roughly ~ 230 MB, and the maximum heap size is 4 GB. This means
> that during the GC:s right before the full GC:s, g1 is choosing to do
> young generation GC:s even though the average live ratio in the older
> regions should be roughly 5% (low-hanging fruit for partial
> collections).

I went back thinking that perhaps the liveness set was being
calculated with a granularity of card size. Because the test by design
ages data psuedo-randomly but uniformly (rather than in "chunks" of
related data allocated close to each-other), I thought that card
granularity may explain the behavior by causing G1 to severely
over-estimate the live data in old regions.

Trying to understand by source inspection, I'm still not sure but I
*think* this is not what's happening but I have definitely not wrapped
my head around the heap region/bit masks/marking stuff.

In any case, I re-ran the test with G1PrintRegionLiveness=5000 (and
with the patch I sent to -dev) to try to see whether the problem was
the estimated live set or not. I ran the test with a 3 GB maximum heap
size instead of 4 GB for practical reasons (lack of memory).

The full GC log is here:

   http://distfiles.scode.org/mlref/gctest/httpgctest-g1-fullgc-20100610/gc.log

But highlights are as follows. Firstly, the last marking phase (that
also completed + its count phase) prior to the full gc starts here:

89.252: [GC pause (young) (initial-mark)89.264: [GC concurrent-mark-start]
 2137M->2131M(3072M), 0.0118340 secs]
89.281: [GC pause (young) 2142M->2133M(3072M), 0.0186420 secs]

So there we have a ~ 2130 MB heap.

The last count phase reports:

   Overall (among printed): 3003514880 used, 1049827840 max live, 34.95%.

And the final full GC:

   122.897: [Full GC 3071M->281M(937M), 3.3331940 secs]

Note that due to the nature of this test, the test has reached steady
state in terms of the actual live set a significant time before the
last initial mark. This means that the live set at the full GC is
roughly the same as the live set at the initial mark (the actual live
set, not the measured).

Firstly, I'm not sure I get the numbers. The total used translates to
2864 MB; almost as high as the maximum heap size, but decidedly higher
than the supposed heap space used. I'm not sure where the discrepancy
is coming from.

In any case, the mark+count results in these partials:

111.054: [GC pause (partial) 2896M->2850M(3072M), 0.0464190 secs]
111.144: [GC pause (partial) 2862M->2811M(3072M), 0.0458500 secs]
111.215: [GC pause (partial) 2822M->2773M(3072M), 0.0529640 secs]
111.294: [GC pause (partial) 2784M->2738M(3072M), 0.0502390 secs]
111.374: [GC pause (partial) 2749M->2707M(3072M), 0.0487850 secs]
111.459: [GC pause (partial) 2718M->2677M(3072M), 0.0478110 secs]
111.544: [GC pause (partial) 2687M->2650M(3072M), 0.0495010 secs]
111.640: [GC pause (partial) 2660M->2626M(3072M), 0.0498750 secs]
111.745: [GC pause (partial) 2635M->2601M(3072M), 0.0500490 secs]
111.857: [GC pause (partial) 2609M->2581M(3072M), 0.0471700 secs]
111.943: [GC pause (partial) 2588M->2577M(3072M), 0.0509000 secs]

Those numbers will of course be affected by both the young and the old
generation, but if I'm reading and interpreting it right, the maximum
amount of old-gen data that was collected was roughly 319 MB, assuming
no young-gen data was generated (not true, but it's an upper bound).

Given that the heap size was reported to be 2130 MB at the start of
the initial mark, that leaves >= 2130-319=1811 MB of heap "used", with
an actual live set of ~ 281 MB (based on full GC).

So, this still leaves me with an actual live set that is <= 15.5%
(281/1811) after partial GC:s stopped.

So:

(1) The 15.5%, upper bound is not consistent with the 34.95% figure.
But then those overall stats were not consistent with the reported
heap size anyway. Either I blew it again with the patch (wouldn't be
surprised), or there is a discrepancy between the used/max_live
produced by the g1 mark/count and the heap stats printed. If the
latter is the case, is this understood?

(2) With an <= 15.5% average utilization you'd definitely want, in an
ideal world, a lot more partial GC:ing going on.

(3) If I ignore the region usage count which is suspect, the
observations still seem consistent with the live set being calculated
at card size granularity. Can anyone confirm/deny?

-- 
/ Peter Schuller

From tony.printezis at oracle.com  Mon Jun 21 08:04:38 2010
From: tony.printezis at oracle.com (Tony Printezis)
Date: Mon, 21 Jun 2010 11:04:38 -0400
Subject: Anyone using -XX:-UseDepthFirstScavengeOrder?
Message-ID: <4C1F7F86.7060602@oracle.com>

Hi all,

When we did the work to introduce depth-first copying in +UseParallelGC 
back in JDK 6 we enabled it by default but left the previous copying 
order in under a flag in case folks wanted to use it (it would be 
enabled with -XX:-UseDepthFirstScavengeOrder). We would like to remove 
the old copying order to simplify our code.

We don't know anyone who uses -XX:-UseDepthFirstScavengeOrder but we 
thought we'd check: is anyone still using it?

Tony, HS GC Group

From peter.schuller at infidyne.com  Wed Jun 23 15:11:25 2010
From: peter.schuller at infidyne.com (Peter Schuller)
Date: Thu, 24 Jun 2010 00:11:25 +0200
Subject: g1 not doing partial aggressively enough -> fallback to full gc
In-Reply-To: <AANLkTimjuM8ytO-Wp0H_V2ByPkih4IIydgYUlTfDWaiY@mail.gmail.com>
References: <AANLkTimb9E8fXhxD2bZxOdodz9KO3j_3bBoyebVU0tOO@mail.gmail.com>
	<AANLkTimjuM8ytO-Wp0H_V2ByPkih4IIydgYUlTfDWaiY@mail.gmail.com>
Message-ID: <AANLkTilrDyLDAtKYEgW_UGqkyYsdX7AxH-wAbKU2dB0j@mail.gmail.com>

For the record (= mailing list archives), using a variant of [1]
(strictly throw-away, but just FYI) I finally realized that the
problem is/was that the predicted remember set scanning cost is
extremely high for the regions that are not being selected for partial
gc. Individual regions, even those that are almost empty (< 10%
liveness) either never reach the top of region candidates, or else are
predicted to be so expensive that even a single region can blow away
the pause time goal. For example:

predict_region_elapsed_time_ms: 26.896303ms total, 26.172088ms rs scan
(36045 cnum), 0.056240 copy time (34720 bytes), 0.667975 other time
predict_region_elapsed_time_ms: 59.700384ms total, 58.974873ms rs scan
(81222 cnum), 0.057536 copy time (35520 bytes), 0.667975 other time
predict_region_elapsed_time_ms: 79.066806ms total, 78.331835ms rs scan
(107881 cnum), 0.066996 copy time (41360 bytes), 0.667975 other time
predict_region_elapsed_time_ms: 76.376391ms total, 75.619144ms rs scan
(104145 cnum), 0.089272 copy time (55112 bytes), 0.667975 other time

What I don't understand now is why there is an accumulation of cards
to be scanned in the remembered set that is so large. Reading the g1
paper I get the impression that mutator threads are supposed to do
rset scanning when the global queue becomes full (and that the size is
not huge), and normally for non-hot cards that a concurrent rs
scanning thread will do the scanning work. (Code-wise I have not yet
figured out whether there even is a dedicated remember set scanning
thread anymore though.)

In this case I have accumulations of > 100 000 cards to be scanned.
That's pretty significant. Presumably (though I have not looked at
this in detail yet) these remembered sets remain large (for whatever
reason) or else the regions would be collected eventually within some
reasonable time. Is there a dedicated thread which is just not
catching up with mutators (in which case one might want it to
prioritize low-liveness regions), or is it the case that there is only
mutator + gc rs scanning in the current g1? If the latter, if the
mutator's don't trigger rs scanning for these regions, and it is never
picked for collection for efficiency reasons, the regions might, it
seems to me, essentially be un-collectable forever.

[1] http://distfiles.scode.org/mlref/g1/g1_region_live_stats_hack.patch

-- 
/ Peter Schuller