Bug in G1 logging? incorrect stop times?

Fri Oct 3 09:03:24 UTC 2014

On Thursday 02 October 2014 16.34.54 Bengt Rutisson wrote:
> On 2014-10-02 15:07, charlie hunt wrote:
> > Hi Thomas,
> > 
> > Don't mean to jump in the middle of your discussion with Bengt ...
> > 
> > Since this is M5, have you tried setting the FX Solaris CPU scheduler,
> > (based on the hypothesis this may be thread scheduling related)?
> 
> Good point, Charlie. My guess is that there is some kind of scheduling
> issue. 2200 threads are quite a few threads to start and stop...
> 
> Thomas, have you tried with ParallelGC to see if you happen to get the
> same issue there. If it is a scheduling issue for safepoints I don't
> think it should be GC dependent.

Part of that information should be available in the logs since Thomas was 
running with -XX:+PrintSafepointStatistics which provides a breakdown of the 
timings of different phases of a safepoint operation.
Unfortunately, the default behavior for the flag is to summarize a number of 
safepoint ops regularly, so you may need to look through a large part of the 
log to find the safepoint corresponding to the problematic GC cycle.

The output should look something like:
$ java -XX:+PrintSafepointStatistics
         vmop                    [threads: total initially_running 
wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
0.117: ParallelGCFailedAllocation       [       8          0              0    
]      [     0     0     0     0     1    ]  0   
0.144: no vm operation                  [       6          0              0    
]      [     0     0     0     0     0    ]  0   

/Mikael

> 
> Bengt
> 
> > hths,
> > 
> > Charlie
> > 
> > 
> > 
> > Sent from my iPhone
> > 
> > On Oct 2, 2014, at 5:35 AM, Thomas Viessmann
> > 
> > <thomas.viessmann at oracle.com <mailto:thomas.viessmann at oracle.com>> wrote:
> >> Hi Bengt,
> >> 
> >> Better late than never :-) .
> >> There are 2200 threads, about 10% are runnable.
> >> The application is the only one running on this M5.
> >> 
> >> We observe this effect quite often with G1. Relatively short
> >> GC times and long application thread stop times.
> >> Interestingly I've never seen this with CMS. But
> >> CMS is not an option here due to fragmentation.
> >> 
> >> Thanks and Regards
> >> 
> >> Thomas
> >> 
> >> On 10/02/14 11:15, Bengt Rutisson wrote:
> >>> Hi Thomas,
> >>> 
> >>> A very late reply. I happened to find this in my inbox and have some
> >>> comments.
> >>> 
> >>> On 2014-08-29 14:15, Thomas Viessmann wrote:
> >>>> Hi all,
> >>>> 
> >>>> 
> >>>> I suspect a bug in G1 logging. Before I file a (possibly incorrect)
> >>>> bug I would like to ask here.
> >>>> Is my understanding correct that the stw evacuation phase starts at
> >>>> *48448.931* and ends at *48454**.034* ?
> >>>> This would mean a stop time of 5.103 s, which is clearly a mismatch
> >>>> to the reported *9.3426141* s
> >>>> If the stw phase in fact started at a later time e.g. at *48449.288
> >>>> *then the stop pause would have been even shorter.
> >>>> 
> >>>> In addition I do not understand how all the relatively short
> >>>> evacuation phases sum up to several seconds.
> >>>> If I add all the MAX values *69.1 + 22.2 + 45+ 150.8 + 0.1 + 155.1
> >>>> + 0.9 + 0.5**= 443.7**ms*
> >>>> 
> >>>> 
> >>>> 
> >>>> 
> >>>> 48443.380: Total time for which application threads were stopped:
> >>>> 0.0084701 seconds
> >>>> *48448.931*: [GC pause (mixed) 48448.931: [G1Ergonomics (CSet
> >>>> Construction) start choosing CSet, _pending_cards: 89458, predicted
> >>>> base time: 73.58 ms, remaining time: 176.42 ms, target pause time:
> >>>> 250.00 ms]
> >>>> 
> >>>>  48448.931: [G1Ergonomics (CSet Construction) add young regions to
> >>>> 
> >>>> CSet, eden: 33 regions, survivors: 5 regions, predicted young
> >>>> region time: 35.73 ms]
> >>>> 
> >>>>  48448.933: [G1Ergonomics (CSet Construction) finish adding old
> >>>> 
> >>>> regions to CSet, reason: reclaimable percentage not over threshold,
> >>>> old: 19 regions, max: 77 regions, reclaimable: 2570649712 bytes
> >>>> (9.98 %), threshold: 10.00 %]
> >>>> 
> >>>>  48448.933: [G1Ergonomics (CSet Construction) added expensive
> >>>> 
> >>>> regions to CSet, reason: old CSet region num not reached min, old:
> >>>> 19 regions, expensive: 7 regions, min: 26 regions, remaining time:
> >>>> 0.00 ms]
> >>>> 
> >>>>  48448.933: [G1Ergonomics (CSet Construction) finish choosing CSet,
> >>>> 
> >>>> eden: 33 regions, survivors: 5 regions, old: 19 regions, predicted
> >>>> pause time: 340.58 ms, target pause time: 250.00 ms]
> >>>> 48449.252: [SoftReference, 962 refs, 0.0064679 secs]48449.259:
> >>>> [WeakReference, 7411 refs, 0.0027722 secs]48449.261:
> >>>> [FinalReference, 272 refs, 0.0022754 secs]48449.264:
> >>>> [PhantomReference, 6 refs, 0.0024348 secs]48449.266: [JNI Weak
> >>>> Reference, 0.0000337 secs] 48449.288: [G1Ergonomics (Concurrent
> >>>> Cycles) do not request concurrent cycle initiation, reason: still
> >>>> doing mixed collections, occupancy: 14898167808 bytes, allocation
> >>>> request: 0 bytes, threshold: 11596411665 bytes (45.00 %), source:
> >>>> end of GC]
> >>>> *48449.288*: [G1Ergonomics (Mixed GCs) do not continue mixed GCs,
> >>>> reason: reclaimable percentage not over threshold, candidate old
> >>>> regions: 156 regions, reclaimable: 2570649712 bytes (9.98 %),
> >>>> threshold: 10.00 %]
> >>>> , 0.3573636 secs]
> >>>> 
> >>>>    [Parallel Time: 314.9 ms, GC Workers: 40]
> >>>>    
> >>>>       [GC Worker Start (ms): Min: 48448933.6, Avg: 48448934.2, Max:
> >>>> 48448934.7, Diff: 1.1]
> >>>> 
> >>>>       [Ext Root Scanning (ms): Min: 15.1, Avg: 17.7, Max: *69.1*,
> >>>> 
> >>>> Diff: 54.0, Sum: 708.0]
> >>>> 
> >>>>       [Update RS (ms): Min: 0.0, Avg: 14.5, Max: *22.2*, Diff:
> >>>> 22.2, Sum: 580.3]
> >>>> 
> >>>>          [Processed Buffers: Min: 0, Avg: 22.0, Max: *45*, Diff:
> >>>> 45, Sum: 879]
> >>>> 
> >>>>       [Scan RS (ms): Min: 88.7, Avg: 142.6, Max: *150.8*, Diff:
> >>>> 62.1, Sum: 5705.2]
> >>>> 
> >>>>       [Code Root Scanning (ms): Min: 0.0, Avg: 0.0, Max: *0.1*,
> >>>> 
> >>>> Diff: 0.1, Sum: 0.1]
> >>>> 
> >>>>       [Object Copy (ms): Min: 131.4, Avg: 138.2, Max: *155.1*,
> >>>> 
> >>>> Diff: 23.8, Sum: 5526.6]
> >>>> 
> >>>>       [Termination (ms): Min: 0.0, Avg: 0.7, Max: *0.9*, Diff: 0.9,
> >>>> 
> >>>> Sum: 27.2]
> >>>> 
> >>>>       [GC Worker Other (ms): Min: 0.0, Avg: 0.3, Max: *0.5*, Diff:
> >>>> 0.5, Sum: 10.0]
> >>>> 
> >>>>       [GC Worker Total (ms): Min: 313.2, Avg: 313.9, Max: *314.7*,
> >>>> 
> >>>> Diff: 1.5, Sum: 12557.5]
> >>>> 
> >>>>       [GC Worker End (ms): Min: 48449247.9, Avg: 48449248.1, Max:
> >>>> 48449248.4, Diff: 0.5]
> >>>> 
> >>>>    [Code Root Fixup: 2.2 ms]
> >>>>    [Code Root Migration: 0.2 ms]
> >>>>    [Clear CT: 2.0 ms]
> >>>>    [Other: 38.2 ms]
> >>>>    
> >>>>       [Choose CSet: 1.7 ms]
> >>>>       [Ref Proc: 15.6 ms]
> >>>>       [Ref Enq: 1.2 ms]
> >>>>       [Free CSet: 2.4 ms]
> >>>>    
> >>>>    [Eden: 1056.0M(1056.0M)->0.0B(1088.0M) Survivors: 160.0M->128.0M
> >>>> 
> >>>> Heap: 14.5G(24.0G)->13.9G(24.0G)]
> >>>> 
> >>>>  [Times: user=12.45 sys=0.03, real=0.36 secs]
> >>>> 
> >>>> *48454**.034*: Total time for which application threads were
> >>>> stopped: *9.3426141* seconds
> >>>> 
> >>>> For completeness; here are the cmd line options:
> >>>> 
> >>>> 
> >>>> /usr/jdk/instances/jdk1.7.0/bin/sparcv9/java -Xms24g -Xmx24g
> >>>> -server  -XX:StringTableSize=27001 -XX:PermSize=512m
> >>>> -XX:MaxPermSize=512m  -Xss512k
> >>>> -XX:+UseG1GC  -XX:SoftRefLRUPolicyMSPerMB=500 -XX:+PrintGCDetails
> >>>> -XX:+PrintGCTimeStamps
> >>>> -Xloggc:/var/cacao/instances/oem-ec/logs/gc.log
> >>>> -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10
> >>>> -XX:GCLogFileSize=10m  -XX:+PrintGCApplicationStoppedTime
> >>>> -XX:MaxGCPauseMillis=250-XX:InitiatingHeapOccupancyPercent=45
> >>>> -XX:+ParallelRefProcEnabled  -XX:ParallelGCThreads=40
> >>>> -XX:+PrintAdaptiveSizePolicy -XX:+PrintSafepointStatistics
> >>>> -XX:+PrintReferenceGC -XX:G1HeapRegionSize=32m
> >>>> -XX:HeapBaseMinAddress=2G
> >>> 
> >>> The GC starts at 48448.931 and reports that it lasts for 0.36
> >>> seconds, so it should have ended at 48449.291. The last log line
> >>> that states that the threads were stopped for 9 seconds is from the
> >>> PrintGCApplicationStoppedTime flag, which logs how long the
> >>> safepoint was. i.e. how long the Java threads where stopped.
> >>> 
> >>> This means that the safepoint lasted for 9 seconds but the GC only
> >>> lasted for 0.36 seconds. So the rest of the time must be attributed
> >>> to stopping and starting the Java threads. The safepoint was started
> >>> at 48444.691 (48454.034 - 9.3426141), which means that it took 4.240
> >>> seconds (48448.931 - 48444.691) to stop all Java threads and 4.743
> >>> seconds (9.3426141 - 4.240 - 0.36) to start them again.
> >>> 
> >>> I don't have a good explanation for why it takes so long to stop and
> >>> start the threads. Are there many Java threads in the application?
> >>> Are there many other processes running that could cause the Java
> >>> threads to be scheduled out?
> >>> 
> >>> Thanks,
> >>> Bengt
> >>> 
> >>>> Thanks and Regards
> >>>> 
> >>>> Thomas
> >>>> 
> >>>> 
> >>>> ORACLE Deutschland B.V. & Co. KG | Riesstr.25 | D-80992 Muenchen
> >>>> 
> >>>> ORACLE Deutschland B.V. & Co. KG
> >>>> Hauptverwaltung: Riesstr. 25, D-80992 Muenchen
> >>>> Registergericht: Amtsgericht Muenchen, HRA 95603
> >>>> Geschäftsführere: Juergen Kunz
> >>>> 
> >>>> Komplementärin: ORACLE Deutschland Verwaltung B.V.
> >>>> Hertogswetering 163/167, 3543 AS Utrecht, Niederlande
> >>>> Handelsregister der Handelskammer Midden-Niederlande, Nr. 30143697
> >>>> Geschäftsführer: Alexander van der Ven, Astrid Kepper, Val Maher
> >>>> 
> >>>> -----------------------------------------------------------------------
> >>>> -
> >>>> -----------------------------------------------------------------------
> >>>> -
> >>>> <mime-attachment.gif> <http://www.oracle.com/commitment> Oracle is
> >>>> committed to developing practices and products that help protect
> >>>> the environment
> >> 
> >> ORACLE Deutschland B.V. & Co. KG | Riesstr.25 | D-80992 Muenchen
> >> 
> >> ORACLE Deutschland B.V. & Co. KG
> >> Hauptverwaltung: Riesstr. 25, D-80992 Muenchen
> >> Registergericht: Amtsgericht Muenchen, HRA 95603
> >> Geschäftsführere: Juergen Kunz
> >> 
> >> Komplementärin: ORACLE Deutschland Verwaltung B.V.
> >> Hertogswetering 163/167, 3543 AS Utrecht, Niederlande
> >> Handelsregister der Handelskammer Midden-Niederlande, Nr. 30143697
> >> Geschäftsführer: Alexander van der Ven, Astrid Kepper, Val Maher
> >> 
> >> ------------------------------------------------------------------------
> >> ------------------------------------------------------------------------
> >> <green-for-email-sig_0.gif> <http://www.oracle.com/commitment> Oracle
> >> is committed to developing practices and products that help protect
> >> the environment