Question about reference processing in JDK6/7

Tue Feb 25 16:10:30 UTC 2014

Hi Andrew,

On Tuesday 25 February 2014 15.04.34 Andrew Dinn wrote:
> I have identified a change in the way references are processed in the
> default (parallal scavenge) GC which is having an effect on legacy
> applications running on JDK6 and appears also to affect the jdk7u tree,
> albeit with less severe impact.
> 
> I think I have pinned down the change to a specific change set in the
> hotspot tree and, indeed, to a specific edit within that change set.
> However, since the edit does not appear to relate directly to the
> documented purpose of the change set I would be grateful if someone on
> the gc dev list could help me identify what the point of the edit was
> and comment on whether the consequences for reference processing were
> intended or unintended.
> 
> Full details of the problem and my diagnosis are included below. Thanks
> for any light anyone in the GC dev team can shed.
> 
> regards,
> 
> 
> Andrew Dinn
> -----------
> 
> Behavioural Manifestation of Problem
> 
> The change in reference processing behaviour was noticed between two Red
> Hat releases of OpenJDK based, respectively, on tags jdk6-b24 and
> jdk6-b28. The same problem has been reported in several customer
> deployments all of which make use of different types of Reference
> instances, whether FinalReference, PhantomReference or whatever.
> 
> Applications which ran in a large heap (1-2 GB)  with a relatively low
> working set (200 MB) on the jdk6-b24 based JVM began to experience
> out-of-memory exceptions on the jdk6-b28 based JVM. Heap dump analysis
> indicated in each case that the heap memory included a very large number
> of Reference instances. These references and their referents accounted
> for the majority of the occupied heap. Some of the references had
> already been discovered by the GC and some of them were still
> undiscovered. The vast majority were not active. In all cases the
> ReferenceProcessor and Finalizer threads were sitting waiting on empty
> queues at OOM.
> 
> In order to reproduce this behaviour I developed a small program
> (attached - run as "java -XX:+PrintGCDetails FinalizeTest") which
> retains a large, bounded set of references to finalizable objects. It
> turns over the retained set at a fairly high rate, using the finalizer
> method to count and occasionally display the number of references
> actually finalized (it counts in blocks of 2^29). This program can be
> used to measure the rate at which the GC can keep up with dropped
> Reference instances.
> 
> On the jdk6-b24 based JVM this test finalizes around 500 blocks before
> the heap fills up. On the jdk6-b28 based JVM it manages at best 1 or 2
> blocks. I also ran the same test on the latest jdk7u. It manages around
> 500 blocks before heap fills up but it sesm to run more Full GCs and
> take longer to reach this limit than the jdk6-b24 based JVM.
> 
> The Culprit Change Set
> 
> The critical change which appears to make the jdk6-b28 based JVM fail is
> part of the following change set
> 
> Revision: 3153
> Branch: default
> Author: stefank  2011-09-01 15:18:17
> Committer: stefank  2011-09-01 15:18:17
> Parent: 3151:27702f012017 (7087583: Hotspot fails to allocate heap with
> mmap(MAP_HUGETLB))
> Child:  3154:05550041d664 (Merge)
> 
>     7085906: Replace the permgen allocated sentinelRef with a
> self-looped end
>     Summary: Remove the sentinelRef and let the last Reference in a
> discovered chain point back to itself.
>     Reviewed-by: ysr, jmasa
> 
> In particular, the change set includes the following changes to method
> PSMarkSweep::mark_sweep_phase1() in file
> hotspot/src/share/vm/gc_implementation/parallelScavenge/psMarkSweep.cpp
> 
> @@ -516,7 +516,6 @@
>    {
>      ParallelScavengeHeap::ParStrongRootsScope psrs;
>      Universe::oops_do(mark_and_push_closure());
> -    ReferenceProcessor::oops_do(mark_and_push_closure());
>      JNIHandles::oops_do(mark_and_push_closure());   // Global (strong)
> JNI handles
>      CodeBlobToOopClosure each_active_code_blob(mark_and_push_closure(),
> /*do_marking=*/ true);
>      Threads::oops_do(mark_and_push_closure(), &each_active_code_blob);
> @@ -623,7 +622,6 @@
> 
>    // General strong roots.
>    Universe::oops_do(adjust_root_pointer_closure());
> -  ReferenceProcessor::oops_do(adjust_root_pointer_closure());
>    JNIHandles::oops_do(adjust_root_pointer_closure());   // Global
> (strong) JNI handles
>    Threads::oops_do(adjust_root_pointer_closure(), NULL);
>    ObjectSynchronizer::oops_do(adjust_root_pointer_closure());
> 
> This change has the consequence that references discovered during young
> generation pauses do not get forwarded to the reference processor queue
> and hence, in the case of FinalReference instances, that the related
> referents do not get processed.
> 
> So, my questions are:
> 
> 1) Why was this change made as part of this change set?
> 
> -- was it necessary or just an extra bonus change thrown in for an
> independent reason?
> 
> -- was it an accident?

This change removed the "sentinel" object used to keep track of the end of the 
discovered lists. That object was special-treated and was allocated in perm 
gen. In OpenJDK7 we were removing all "normal" java objects from perm gen and 
only keeping metadata objects there.
Moving the "sentinel" out of perm gen would be impractical since the running 
garbage collection could then move the object while it's being used by the 
collector.

ReferenceProcessor::oops_do was only needed to keep the sentinel object alive 
and adjust the static pointer if a perm gen compaction was performed.

AFAIK this change was not intended to change the rate at which references were 
processed, only to find a new solution for how to keep track of the linked 
lists without using a "sentinel" object.

> 
> 2) Was there meant to be some compensating change to ensure that
> reference processing was not delayed?
> 
> -- in particular, has this compensating change been applied in the jdk7u
> tree?
> 
> -- if so does anyone know what the relevant change is and (for bonus
> points) whether we can pull it into our next OpenJDK6 release?

The test seems to work on the GA build of (Oracle) JDK7 and some older 
(Oracle) JDK6 releases so there may be some issue local to the unique 
combination of JVM and JDK changes present in OpenJDK6.

> 
> 3) Does this change reflect a "don't care" attitude to programs which
> fail with OOM because they generate references fast enough that the
> modified GC cannot catch up with the mutators?
> 
> -- or was it just an accidental side-effect of the change I have identified?
> 
> -- or maybe just an unavoidable side-effect which has to be lived with?

I'd say that this is a bug that should be fixed but I don't have the cycles to 
dig into what caused it.

/Mikael