Review request: 7072527 CMS: JMM GC counters overcount in some cases
Kevin Walls
kevin.walls at oracle.com
Fri Jul 29 16:35:52 UTC 2011
Hi --
I got this ready in a webrev:
http://cr.openjdk.java.net/~kevinw/7072527/webrev.00/
The new test is passing, as are the old ones. 8-)
Also, this change means that a CMS cycle ending in concurrent mode
failure now counts as one collection. One could argue that either way
(should it be 2 collections?) - but I'm thinking if we are counting
"completed" collections then we are now counting correctly!
Thanks
Kevin
On 29/07/11 02:08, Y. S. Ramakrishna wrote:
> I filed: 7072527 CMS: JMM GC counters overcount in some cases
>
> On 07/28/11 17:29, Y. S. Ramakrishna wrote:
>> Hi Kevin --
>>
>> thanks for jumping on this! More inline below ...
>>
>> On 07/28/11 09:33, Krystal Mok wrote:
>>> Hi Kevin,
>>>
>>> Thank you for taking care of this, and it's good to see the problem
>>> is verified.
>>>
>>> I think whether or not the suggested fix is sufficient depends on
>>> what paths can reach CMSCollector::do_compaction_work(). If all
>>> paths that can reach CMSCollector::do_compaction_work() come from
>>> GenCollectedHeap::do_collection(), then the fix should be good to
>>> go. Otherwise it'll need a better workaround.
>>>
>>> I believe all concurrent mode failures/interrupts (which includes
>>> the System.gc() case) does come from
>>> GenCollectedHeap::do_collection(), but I'm not exactly sure about
>>> this, could anybody please clarify on it?
>>
>> Yes, i believe this is indeed the case, and my browsing of the
>> code using cscope seemed to confirm that belief.
>>
>> More below ...
>>
>>>
>>> Regards,
>>> Kris Mok
>>>
>>> On Thu, Jul 28, 2011 at 10:12 PM, Kevin Walls
>>> <kevin.walls at oracle.com <mailto:kevin.walls at oracle.com>> wrote:
>>>
>>> __
>>> Hi --
>>>
>>> 6580448 was marked as a duplicate of 6581734, which fixed the fact
>>> that CMS collections were just not counted at all - with CMS,
>>> only a
>>> stop the world full gc would be counted in the stats.
>>>
>>> But looks like you're right... Here is a quick variation of the
>>> testcase from 6581734 which shows the same thing, and this verifies
>>> the same, and is solved by ExplicitGCInvokesConcurrent. If
>>> there is
>>> no other feedback I can test if the removal of the
>>> TraceCMSMemoryManagerStats() call in
>>> CMSCollector::do_compaction_work is all we need...
>>
>>
>> Kevin, yes, it would be great if you could verify this and push the fix.
>> I am not sure if the push would need to wait for the signing of OCA
>> from Kris, but best to check with Those Who Would Know Such Things.
>>
>> Since the original CR has been closed, i'll open one momentarily and
>> can make you RE (if that's OK with you). I'll be happy to serve as
>> reviewer of the change.
>>
>> As regards the jstat counter reporting two pauses per concurrent
>> CMS cycle, I am of two minds on what the original intention
>> was. I'd have originally regarded the double increment as a
>> bug, but as you state it is really two pauses, even if part of
>> a single cycle. And it makes sense to count them as two. I
>> agree that this should be documented and left alone, given
>> how long we have had this behaviour, and the alternative
>> (of counting cycles, rather than pauses) may be no better
>> (or arguably worse). There's actually an open CR for this
>> which we can redirect into a CR to update the relevant documentation.
>>
>> -- ramki
>>
>>>
>>> Regards
>>> Kevin
>>>
>>>
>>> /*
>>> * Copyright (c) 2011, Oracle and/or its affiliates. All rights
>>> reserved.
>>> * DO NOT ALTER OR REMOVE COPYRIGHT NOTICES OR THIS FILE HEADER.
>>> *
>>> * This code is free software; you can redistribute it and/or
>>> modify it
>>> * under the terms of the GNU General Public License version 2
>>> only, as
>>> * published by the Free Software Foundation.
>>> *
>>> * This code is distributed in the hope that it will be useful, but
>>> WITHOUT
>>> * ANY WARRANTY; without even the implied warranty of
>>> MERCHANTABILITY or
>>> * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public
>>> License
>>> * version 2 for more details (a copy is included in the LICENSE
>>> file that
>>> * accompanied this code).
>>> *
>>> * You should have received a copy of the GNU General Public
>>> License
>>> version
>>> * 2 along with this work; if not, write to the Free Software
>>> Foundation,
>>> * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA.
>>> *
>>> * Please contact Oracle, 500 Oracle Parkway, Redwood Shores, CA
>>> 94065 USA
>>> * or visit www.oracle.com <http://www.oracle.com> if you need
>>> additional information or have any
>>> * questions.
>>> */
>>>
>>> /*
>>> * @test TestFullGCount.java
>>> * @bug
>>> * @summary
>>> * @run main/othervm -XX:+UseConcMarkSweepGC TestFullGCCount
>>> *
>>> */
>>> import java.util.*;
>>> import java.lang.management.*;
>>>
>>>
>>>
>>> public class TestFullGCCount {
>>>
>>> private String poolName = "CMS";
>>> private String collectorName = "ConcurrentMarkSweep";
>>>
>>> public static void main(String [] args) {
>>>
>>> TestFullGCCount t = null;
>>> if (args.length==2) {
>>> t = new TestFullGCCount(args[0], args[1]);
>>> } else {
>>> System.out.println("Defaulting to monitor CMS pool and
>>> collector.");
>>> t = new TestFullGCCount();
>>> }
>>> t.run();
>>>
>>> }
>>>
>>> public TestFullGCCount(String pool, String collector) {
>>> poolName = pool;
>>> collectorName = collector;
>>> }
>>> public TestFullGCCount() {
>>> }
>>>
>>> public void run() {
>>>
>>> int count = 0;
>>> int iterations = 20;
>>> long counts[] = new long[iterations];
>>> boolean diffAlways2 = true; // assume we will fail
>>>
>>> for (int i=0; i<iterations; i++) {
>>> System.gc();
>>> counts[i] = checkStats();
>>> if (i>0) {
>>> if (counts[i] - counts[i-1] != 2) {
>>> diffAlways2 = false;
>>> }
>>> }
>>> }
>>> if (diffAlways2) {
>>> throw new RuntimeException("FAILED: difference in count is
>>> always 2.");
>>> }
>>> System.out.println("Passed.");
>>> }
>>>
>>> private long checkStats() {
>>> long count = 0;
>>> List<MemoryPoolMXBean> pools =
>>> ManagementFactory.getMemoryPoolMXBeans();
>>> List<GarbageCollectorMXBean> collectors =
>>> ManagementFactory.getGarbageCollectorMXBeans();
>>> for (int i=0; i<collectors.size(); i++) {
>>> GarbageCollectorMXBean collector = collectors.get(i);
>>> String name = collector.getName();
>>> if (name.contains(collectorName)) {
>>> System.out.println(name + ": collection count = "
>>> + collector.getCollectionCount());
>>> count = collector.getCollectionCount();
>>> }
>>> }
>>> return count;
>>>
>>> }
>>>
>>> }
>>>
>>>
>>> On 27/07/11 17:12, Krystal Mok wrote:
>>>> Hi all,
>>>>
>>>> I've been looking at a strange inconsistency of full GC count
>>>> recorded by jvmstat and JMM counters. I'd like to know which ones
>>>> of the following behaviors are by design, which ones are bugs, and
>>>> which ones are just my misunderstanding. I apologize for making a
>>>> short story long...
>>>>
>>>> =====================================================
>>>>
>>>> The counters involved:
>>>>
>>>> * A jvmstat counter named "sun.gc.collector.1.invocations" keeps
>>>> track of the number of pauses occured as a result of a major
>>>> collection. It is used by utilities such as jstat as the source of
>>>> "FGC" (full collection count), and the old gen collection count in
>>>> Visual GC. It's updated by an TraceCollectorStats object.
>>>> * A JMM counter, GCMemoryManager::_num_collections, keeps track of
>>>> the number of collections that have ended. This counter is used as
>>>> HotSpot's implementation of the JMX
>>>> GarbageCollectorMXBean.getCollectionCount(). It's updated by
>>>> either a TraceMemoryManagerStats object or a
>>>> TraceCMSMemoryManagerStats object.
>>>>
>>>> To show the situation, I've made a screenshot of a VisualVM and a
>>>> JConsole running side by side, both are monitoring the VisualVM's
>>>> GC stats:
>>>>
>>>> http://dl.iteye.com/upload/attachment/524811/913cb0e1-7add-3ac0-a718-24ca705cad22.png
>>>>
>>>> (I'll upload the screenshot to somewhere else if anybody can't
>>>> see it)
>>>> The VisualVM instance is running on JDK6u26, with ParNew+CMS.
>>>> In the screenshot, Visual GC reports that the old gen collection
>>>> count is 20, while JConsole reports 10.
>>>>
>>>> I see that there was this bug:
>>>> 6580448: CMS: Full GC collection count mismatch between
>>>> GarbageCollectorMXBean and jvmstat (VisualGC)
>>>> I don't think the current implementation has a bug in the sense
>>>> that the two counters don't report the same number.
>>>>
>>>> This behavior seems reasonable, but the naming of the value in
>>>> these tools are confusing: both tools say "collections", but
>>>> apparently the number in Visual GC means "number of pauses" where
>>>> as the number in JConsole means "number of collection cycles".
>>>> It'd be great if the difference could be documented somewhere, if
>>>> that's the intended behavior.
>>>>
>>>> And then the buggy behavior. Code demo posted on gist:
>>>> https://gist.github.com/1106263
>>>> Starting from JDK6u23, when using CMS without
>>>> ExplicitGCInvokesConcurrent, System.gc() (or
>>>> Runtime.getRuntime().gc(), or MemoryMXBean.gc() via JMX) would
>>>> make the JMM GC counter increment by 2 per invocation, while the
>>>> jvmstat counter is only incremented by 1. I believe the latter is
>>>> correct and the former needs some fixing.
>>>>
>>>> =====================================================
>>>>
>>>> My understanding of the behavior shown above:
>>>>
>>>> 1. The concurrent GC part:
>>>>
>>>> There are 2 pauses in a CMS concurrent GC cycle, one in the
>>>> initial mark phase, and one in the final remark phase.
>>>> To trigger a concurrent GC cycle, the CMS thread wakes up
>>>> periodically to see if it shouldConcurrentCollect(), and trigger a
>>>> cycle when the predicate returned true, or goes back to sleep if
>>>> the predicate returned false. The whole concurrent GC cycle
>>>> doesn't go through GenCollectedHeap::do_collection().
>>>>
>>>> The jvmstat counter for old gen pauses is updated in
>>>> CMSCollector::do_CMS_operation(CMS_op_type op), which exactly
>>>> covers both pause phases.
>>>>
>>>> The JMM counter, however, is updated in the concurrent sweep
>>>> phase, CMSCollector::sweep(bool asynch), if there was no
>>>> concurrent mode failure; or it is updated in
>>>> CMSCollector::do_compaction_work(bool clear_all_soft_refs) in case
>>>> of a bailout due to concurrent mode failure (advertised as so in
>>>> the code comments). So that's an increment by 1 per concurrent GC
>>>> cycle, which does reflect the "number of collection cycles", fair
>>>> enough.
>>>>
>>>> So far so good.
>>>>
>>>> 2. The System.gc() part:
>>>>
>>>> Without ExplicitGCInvokesConcurrent set, System.gc() does a
>>>> stop-the-world full GC, which consists of only one pause, so
>>>> "number of pauses" would equal "number of collections" in this
>>>> case.
>>>> It will go into GenCollectedHeap::do_collection(); both the
>>>> jvmstat and the JMM GC counter gets incremented by 1 here,
>>>>
>>>> TraceCollectorStats tcs(_gens[i]->counters());
>>>> TraceMemoryManagerStats tmms(_gens[i]->kind());
>>>>
>>>> But, drilling down into:
>>>> _gens[i]->collect(full, do_clear_all_soft_refs, size, is_tlab);
>>>>
>>>> That'll eventually go into:
>>>> CMSCollector::acquire_control_and_collect(bool full, bool
>>>> clear_all_soft_refs)
>>>>
>>>> System.gc() is user requested so that'll go further into
>>>> mark-sweep-compact:
>>>> CMSCollector::do_compaction_work(bool clear_all_soft_refs)
>>>> And here, it increments the JMM GC counter again (remember it was
>>>> in the concurrent GC path too, to handle bailouts), even though
>>>> this is still in the same collection. This leads to the "buggy
>>>> behavior" mentioned earlier.
>>>>
>>>> The JMM GC counter wasn't added to CMS until this fix got in:
>>>> 6581734: CMS Old Gen's collection usage is zero after GC which is
>>>> incorrect
>>>>
>>>> The code added to CMSCollector::do_compaction_work() works fine in
>>>> the concurrent GC path, but interacts badly with the original
>>>> logic in GenCollectedHeap::do_collection().
>>>>
>>>> =====================================================
>>>>
>>>> I thought all concurrent mode failures/interrupts come from
>>>> GenCollectedHeap::do_collection(). If that's the case, then it
>>>> seems unnecessary to update the JMM GC counter in
>>>> CMSCollector::do_compaction_work(), simply removing it should fix
>>>> the problem.
>>>>
>>>> With that, I'd purpose the following (XS) change: (diff against
>>>> HS20)
>>>>
>>>> diff -r f0f676c5a2c6
>>>>
>>>> src/share/vm/gc_implementation/concurrentMarkSweep/concurrentMarkSweepGeneration.cpp
>>>>
>>>> ---
>>>>
>>>> a/src/share/vm/gc_implementation/concurrentMarkSweep/concurrentMarkSweepGeneration.cpp
>>>>
>>>> Tue Mar 15 19:30:16 2011 -0700
>>>> +++
>>>>
>>>> b/src/share/vm/gc_implementation/concurrentMarkSweep/concurrentMarkSweepGeneration.cpp
>>>>
>>>> Thu Jul 28 00:02:41 2011 +0800
>>>> @@ -2022,9 +2022,6 @@
>>>>
>>>> _intra_sweep_estimate.padded_average());
>>>> }
>>>>
>>>> - {
>>>> - TraceCMSMemoryManagerStats();
>>>> - }
>>>> GenMarkSweep::invoke_at_safepoint(_cmsGen->level(),
>>>> ref_processor(), clear_all_soft_refs);
>>>> #ifdef ASSERT
>>>>
>>>> The same goes for the changes in:
>>>> 7036199: Adding a notification to the implementation of
>>>> GarbageCollectorMXBeans
>>>>
>>>> =====================================================
>>>>
>>>> P.S. Is there an "official" name for the counters that I referred
>>>> to as "jvmstat counters" above? Is it just jvmstat, or PerfData or
>>>> HSPERFDATA?
>>>
>>>
More information about the hotspot-gc-dev
mailing list