CMS vs G1 - Scan RS very long

Thu Feb 21 10:13:09 UTC 2013

Hi John,

The coarsening was indeed causing very long RSet scan times. We are now printing the summarized RSet stats every tenth young gen. The increased G1RSetRegionEntries (512) has reduced the RSet scan time but the G1 young gen is still 2.5x slower than the ParNew. We have additionally reduced the minimum NewSize to 32m to be able to keep with the MaxGCPauseMillis set to 60ms.

Updated results
ParNew - eden 94M (not adapting) - <60ms 
G1 - eden around 36M (adapting) - ~60ms

Obviously the size of the eden impacts invocation frequency of the young gen, frequent young gens are breaking aging and lead to unwanted promotions. 

We are still not able to make the G1 young gen running any closer to the ParNew. Find the attached new GC log and jinfo files.

Regards,
Michal

Concurrent RS processed 355044407 cards
  Of 1567763 completed buffers:
      1567600 (100.0%) by conc RS threads.
          163 (  0.0%) by mutator threads.
  Conc RS threads times(s)
         1049.36    587.09    170.71    162.47    161.34    159.19    158.13    157.45    15.93
  Total heap region rem set sizes = 103712K.  Max = 299K.
  Static structures = 556K, free_lists = 175K.
    24045190 occupied cards represented.
    Max size region = 836:(O)[0xfffffd7e2de00000,0xfffffd7e2e000000,0xfffffd7e2e000000], size = 300K, occupied = 34K.
    Did 779 coarsenings.

2013-02-21T10:29:35.262+0100: 12476.751: [GC pause (young) 12476.752: [G1Ergonomics (CSet Construction) start choosing CSet, _pending_cards: 5104, predicted base time: 14.92 ms, remaining time: 45.08 ms, target pause time: 60.00 ms]
 12476.752: [G1Ergonomics (CSet Construction) add young regions to CSet, eden: 17 regions, survivors: 3 regions, predicted young region time: 41.22 ms]
 12476.752: [G1Ergonomics (CSet Construction) finish choosing CSet, eden: 17 regions, survivors: 3 regions, old: 0 regions, predicted pause time: 56.14 ms, target pause time: 60.00 ms]
, 0.0626134 secs]
   [Parallel Time: 57.3 ms, GC Workers: 8]
      [GC Worker Start (ms): Min: 12476751.8, Avg: 12476751.9, Max: 12476752.0, Diff: 0.2]
      [Ext Root Scanning (ms): Min: 3.8, Avg: 4.4, Max: 5.1, Diff: 1.3, Sum: 35.5]
      [Update RS (ms): Min: 4.9, Avg: 5.8, Max: 6.7, Diff: 1.7, Sum: 46.4]
         [Processed Buffers: Min: 2, Avg: 5.9, Max: 16, Diff: 14, Sum: 47]
      [Scan RS (ms): Min: 33.2, Avg: 33.5, Max: 33.9, Diff: 0.6, Sum: 268.4]
      [Object Copy (ms): Min: 13.0, Avg: 13.2, Max: 13.2, Diff: 0.2, Sum: 105.2]
      [Termination (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.2]
      [GC Worker Other (ms): Min: 0.0, Avg: 0.1, Max: 0.1, Diff: 0.0, Sum: 0.5]
      [GC Worker Total (ms): Min: 56.9, Avg: 57.0, Max: 57.1, Diff: 0.2, Sum: 456.1]
      [GC Worker End (ms): Min: 12476808.9, Avg: 12476808.9, Max: 12476808.9, Diff: 0.1]
   [Code Root Fixup: 0.0 ms]
   [Clear CT: 1.0 ms]
   [Other: 4.3 ms]
      [Choose CSet: 0.0 ms]
      [Ref Proc: 0.2 ms]
      [Ref Enq: 0.0 ms]
      [Free CSet: 0.9 ms]
   [Eden: 34.0M(34.0M)->0.0B(26.0M) Survivors: 6144.0K->6144.0K Heap: 2168.4M(8776.0M)->2134.6M(8776.0M)]
 [Times: user=0.47 sys=0.01, real=0.06 secs]

G1 parameters
-XX:+IgnoreUnrecognizedVMOptions -XX:+AggressiveOpts -XX:-OmitStackTraceInFastThrow -XX:+ExplicitGCInvokesConcurrent -XX:MaxInlineLevel=16 -XX:MaxInlineSize=256 -XX:+UnlockDiagnosticVMOptions -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintAdaptiveSizePolicy -XX:+G1SummarizeRSetStats -XX:G1SummarizeRSetStatsPeriod=10 -Xmx8896m -Xms2048m -XX:NewSize=32m -XX:PermSize=128m -XX:SurvivorRatio=5 -XX:MaxTenuringThreshold=15 -XX:+UseG1GC -XX:MaxGCPauseMillis=60 -XX:G1HeapRegionSize=2m -XX:G1MixedGCCountTarget=16 -XX:G1HeapWastePercent=10 -XX:G1RSetRegionEntries=512 -XX:ParallelGCThreads=8 -XX:ConcGCThreads=2 

CMS parameters
-XX:+IgnoreUnrecognizedVMOptions -XX:+AggressiveOpts -XX:-OmitStackTraceInFastThrow -XX:+ExplicitGCInvokesConcurrent -XX:MaxInlineLevel=16 -XX:MaxInlineSize=256 -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+TraceClassUnloading -XX:+PrintHeapAtGC -XX:PrintCMSStatistics=1 -XX:+PrintCMSInitiationStatistics -XX:+PrintTenuringDistribution -XX:+PrintAdaptiveSizePolicy -Xmx8884m -Xms2048m -XX:NewSize=128m -XX:MaxNewSize=128m -XX:PermSize=128m -XX:SurvivorRatio=5 -XX:MaxTenuringThreshold=15 -XX:CMSMarkStackSize=8M -XX:CMSMarkStackSizeMax=32M -XX:+UseConcMarkSweepGC -XX:+CMSClassUnloadingEnabled -XX:CMSWaitDuration=60000 -XX:+CMSScavengeBeforeRemark -XX:CMSTriggerInterval=600000 -XX:+UseParNewGC -XX:ParallelGCThreads=8 -XX:ParallelCMSThreads=2

Od: "John Cuthbertson" john.cuthbertson at oracle.com
Komu: "Michal Frajt" michal at frajt.eu
Kopie: monica.beckwith at oracle.com, hotspot-gc-dev at openjdk.java.net
Datum: Wed, 06 Feb 2013 12:05:11 -0800
Předmet: Re: CMS vs G1 - Scan RS very long

Hi Michal,
> 
Apologies for the delay.
> 
You can confirm whether RSet coarsening is the cause of your long RSet
scan times using the flags:
> 
-XX:+G1SummarizeRSetStats -XX:G1SummarizeRSetStatsPeriod=n
> 
Basically every n-th GC you'll get a summary of the RSet activity -
including total size, region with the largest RSet, concurrent
activity, and the number of coarsenings. These are diagnostic flags so
you'll need -XX:+UnlockDiagnosticVMOptions.
> 
If the number of coarsenings is high then you increase the number of
entries in the fine-grain RSet table using by setting
G1RSetRegionEntries. By default this gets set ergonomically using a
base number (256) per Mb of region size. So for the region size in your
log it will be set to 256.  So this means that we can hold references
from 256 distinct regions in each fine grain table before we start
coarsening references from regions. This equates to 12.5% of your
heap.  I would suggest bumping that value up to 512.
> 
Alternatively you can also try changing the heap region size by
explicitly setting -XX:G1HeapRegionSize=2m. In addition to increasing
the capacity of the fine grain tables, this should reduce the number of
entries in the RSets since there should be fewer cross-region
references.
> 
Regards,
> 
JohnC
> 
On 02/01/13 09:15, Michal Frajt wrote:

  > 
  > Monica,
  >  
  > Today morning I sent an email with the correct log files.
The attachment size was over the limit, it is still waiting for an
approval to be added to the posting list.
  >  
  > Some points
  > -         
  we compare G1 with CMS-ParNew (not iCMS, even we are
using iCMS everywhere else)
  > -         
  currently we evaluate the young gen processing only
  > -         
  now we set the G1RSetRegionEntries to 1024, it reduced
Scan RS time but it is still 3 times slower than ParNew
  > -         
  we do set NewSize to 128 for both the G1 and ParNew to
keep young collection invoked on same amount of garbage
  > -         
  setting G1 NewSize to 16GB and the target time to 60ms is
able to keep the track with the ParNew but the G1 young gen is invoked
3 times more frequently as adapted Eden size is around 30MB (ParNew has
128MB-survivor spaces)
  > -         
  I will be able to provide you the new log files and more
details on Tuesday
  >  
  > Thanks,
  > Michal
  >  
  > 
  > 
  > From:
Monica Beckwith [mailto:monica.beckwith at oracle.com] > 
  Sent: Friday, February 01, 2013 4:39 PM> 
  To: Michal Frajt> 
  Cc: hotspot-gc-dev at openjdk.java.net; John Cuthbertson> 
  Subject: Re: CMS vs G1 - Scan RS very long

  >  
  > Thanks, Michal for providing us with the logs.
John (Cuthbertson) and I had a great discussion yesterday pertaining to
your G1 log and we think we understand what's going on with your
workload and have a experiment(s) plan for you. John is going to send
in more details (with explanation). Your workload's pattern is very
interesting to me/us and I have started a "work folder" for it, so that
I can document our experiment plan and the results and keep track of
the progress.> 
  > 
Thanks for sharing your logs!> 
  > 
-Monica> 
  > 
On 2/1/2013 2:12 AM, Michal Frajt wrote: 
  > 
  > 
  > 
  > Sorry, forgot to attach the archive in the last
message. 

  > 
  >  

  > 
  > Od: hotspot-gc-dev-bounces at openjdk.java.net

  > 
  > Komu: monica.beckwith at oracle.com

  > 
  > Kopie: hotspot-gc-dev at openjdk.java.net

  > 
  > Datum: Fri, 1 Feb 2013 09:05:58 +0100

  > 
  > Předmet: Re: CMS vs G1 - Scan RS very long

  >  
  > 
  > 
  > 
  > 
  > > Hi Monica / John,

  > 
  > >  

  > 
  > > Find the new archive which correctly
contains both the CMS and G1 log files. Yesterday I somehow managed to
overwrite the CMS log file when creating the tar archive. 

  > 
  > >   

  > 
  > > Node2 - running CMS

  > 
  > > mmreutersservernode2_20130131105521_gc.log> 
> mmreutersservernode2_jinfo.txt

  > 
  > >  

  > 
  > > Node3 - running G1> 
> mmreutersservernode3_20130131105525_gc.log> 
> mmreutersservernode3_jinfo.txt

  > 
  > >  

  > 
  > > Regards,

  > 
  > > Michal 

  > 
  > >  

  > 
  > > Od: "Monica Beckwith" monica.beckwith at oracle.com

  > 
  > > Komu: "John Cuthbertson" john.cuthbertson at oracle.com

  > 
  > > Kopie: hotspot-gc-dev at openjdk.java.net,"Michal
Frajt" michal at frajt.eu

  > 
  > > Datum: Thu, 31 Jan 2013 12:44:46 -0600

  > 
  > > Předmet: Re: CMS vs G1 - Scan RS very long

  >  
  > 
  > > > Hi Michal (and John),> 
> > > 
> > I did look at Michal's log files (BTW, both are the same G1
logs). And I can confirm that scan RS is the issue... here's the plot:> 
> > > 
> > > 
> > The plot above only shows the max obj copy times (for all
pauses), max RS Scan times and Parallel Time. So yes, scan RS is the
culprit. > 
> > Also, looking at your logs, it seems that for mixedGCs the
reclaimable bytes don't cross 6%. So can you please try increasing the
G1HeapWastePercent to 10? (the latest builds will have 10 as the
default value).> 
> > > 
> > Please let me know if that improves your RT.> 
> > > 
> > -Monica> 
> > > 
> > > 
> > On 1/31/2013 11:56 AM, John Cuthbertson wrote: 
  > Hi Michal, > 
> > > 
> > I haven't looked at the logs yet but from your description it
sounds like a large number of the RSets have been coarsened and/or the
fine grain tables have gotten very dense. > 
> > > 
> > The RSet for a heap region tracks incoming references to
objects in that region. There are three levels on granularity: sparse,
fine, and coarse. The sparse and fine entries are encoded in open hash
tables based upon the region containing the references. As the number
of references from region A that point into region B increase, the
number of cards in the hash table entry for region A increase (it's
actually a bitmap with one bit per card) and as the number of regions
that contain references that point into B increase, the number of
region entries in fine grain table increase. Once we run out of space
in the fine grain table (i.e. we can't add another bitmap for another
region) we evict one of the densest region bitmaps and say "coarsen"
that entry. When we have a coarsened entry we have to search the entire
referencing region looking for incoming references compared to
searching specific cards with the sparse and fine entries. > 
> > > 
> > I'll take a look at your logs to see if I can confirm. > 
> > > 
> > JohnC > 
> > > 
> > > 
> > On 1/31/2013 6:12 AM, Michal Frajt wrote: > 
> > 
  > Hi all, > 
> > > 
> > After the iCMS got officially deprecated we decided to
compare the G1 collector with our best tuned (i)CMS setup.
Unfortunately we are not able to make the G1 young collection running
any closer to the ParNew. Actually we wanted to compare the G1
concurrent marking STW pauses with the CMS initial-mark and remark STW
pauses but already incredibly long running G1 young collections are
unacceptable for us. > 
> > > 
> > We were able to recognize that the very long G1 young
collections are caused by the scanning remembered sets. There is not
much documentation about G1 internals but we were able to understand
that the size of the remembered sets is related to the amount of
mutating references from old regions (cards) to young regions.
Unfortunately all our applications mutate permanently thousands
references from old objects to young objects. > 
> > > 
> > We are testing with the latest OpenJDK7u extended by the
7189971 patch and CMSTriggerInterval implementation. The attached GC
log files represent two very equal applications processing very similar
data sets, one running the G1, second running the CMS collector. The
OpenJDK7u has an extra output of _pending_cards (when
G1TraceConcRefinement activated) which somehow relates to the
remembered sets size. > 
> > > 
> > Young Comparison (both 128m, survivor ratio 5, max tenuring
15) > 
> > CMS - invoked every ~20 sec, avg. stop 60ms > 
> > G1 - invoked every ~16 sec, avg. stop 410ms !!! > 
> > > 
> > It there anything what could help us to reduce the Scan RS
time or the G1 is simply not targeted for applications mutating heavily
old region objects? > 
> > > 
> > CMS parameters > 
> > -Xmx8884m -Xms2048m -XX:NewSize=128m -XX:MaxNewSize=128m
-XX:PermSize=128m -XX:SurvivorRatio=5 -XX:MaxTenuringThreshold=15
-XX:CMSMarkStackSize=8M -XX:CMSMarkStackSizeMax=32M
-XX:+UseConcMarkSweepGC -XX:+CMSClassUnloadingEnabled
-XX:CMSWaitDuration=60000 -XX:+CMSScavengeBeforeRemark
-XX:CMSTriggerInterval=600000 -XX:+UseParNewGC -XX:ParallelGCThreads=8
-XX:ParallelCMSThreads=2 > 
> > > 
> > G1 parameters (mind MaxNewSize not specified) > 
> > -Xmx8884m -Xms2048m -XX:NewSize=128m -XX:PermSize=128m
-XX:SurvivorRatio=5 -XX:MaxTenuringThreshold=15 -XX:+UseG1GC
-XX:MaxGCPauseMillis=200 -XX:G1MixedGCCountTarget=16
-XX:ParallelGCThreads=8 -XX:ConcGCThreads=2 > 
> > > 
> > G1 log file GC young pause > 
> > [GC pause (young) [G1Ergonomics (CSet Construction) start
choosing CSet, _pending_cards: 23697, predicted base time: 32.88 ms,
remaining > 
> > time: 167.12 ms, target pause time: 200.00 ms] > 
> > [Parallel Time: 389.8 ms, GC Workers: 8] > 
> >      >>>> > 
> >      [Scan RS (ms): Min: 328.8, Avg: 330.4, Max: 332.6, Diff:
3.8, Sum: 2642.9] > 
> >      <<<< > 
> > [Eden: 119.0M(119.0M)->0.0B(118.0M) Survivors:
9216.0K->10.0M Heap: 1801.6M(2048.0M)->1685.7M(2048.0M)] > 
> > > 
> > Regards, > 
> > Michal > 
> > > 
> > > 
> > 
  > 
> > 
  > 
> > 
  > 
  > > > -- > 
> > > 
> > Monica
Beckwith | Java Performance Engineer> 
> > VOIP: +1 512 401 1274 > 
> > Texas > 
> > Oracle
is committed to developing practices and products that help protect the
environment 

  >  
  > 
  > -- > 
  > 
  Monica
Beckwith | Java Performance Engineer> 
VOIP: +1
512 401 1274 > 
Texas > 
  Oracle
is committed to developing practices and products that help protect the
environment