CRR (L): 6888336: G1: avoid explicitly marking and pushing objects in survivor spaces

Wed Dec 21 22:37:44 UTC 2011

Hi all,

I'd like a couple of code reviews for the following non-trivial changes 
(large, not necessary in lines of code modified but more due to the fact 
that the evacuation pause / concurrent marking interaction is changed 
quite dramatically):

http://cr.openjdk.java.net/~tonyp/6888336/webrev.0/

Here's some background, motivation, and a summary of the changes (I felt 
that it was important to write a longer then usual explanation):

* Background / Motivation

Each G1 heap region has a field top-at-mark-start (aka TAMS) which 
denotes where the top of the region was when marking started. An object 
is considered implicitly live if it's over TAMS (i.e., it was allocated 
since marking started) or explicitly live if it's below TAMS (i.e., it 
was allocated before marking started) and marked on the bitmap. (It 
follows that it's unnecessary to explicitly mark objects over TAMS.)

In fact, we have two copies of the above marking information: "Next TAMS 
/ Next Bitmap" and "Prev TAMS / Prev Bitmap". Prev is the copy that was 
obtained by the last marking cycle that was successfully completed (so, 
it is consistent: all live objects should appear as live in the prev 
marking information). Next is the copy that will be obtained / is 
currently being obtained and it's not consistent because it's not 
guaranteed to be complete.

G1 uses SATB marking which has the advantage not to require objects 
allocated since the start of marking to be visited at all by the marking 
threads (they are implicitly live and they do not need to be scanned). 
So, the active marking cycle can totally ignore objects over NTAMS 
(since they have been allocated since marking started).

The current interaction between evacuation pauses (let's call these 
"GCs" from now on) and concurrent marking is very tricky. Even though 
marking ignores all objects over NTAMS (currently: all objects in Eden 
regions) it still has to visit and mark objects in the Survivors 
regions. But those will be moved by subsequent GCs. So, a GC needs to be 
aware that it's moving objects that have been marked by the marking 
threads and not only propagate those marks but also notify the marking 
threads that said objects have been moved. For that we use several data 
structures: pushes to the global marking stack and also to what's 
referred to as the "region stack" which is only used by the GC to push a 
group of objects instead of pushing them individually  ("region" here is 
a mem region and smaller than a G1 region).

Additionally, because the marking threads could come across objects that 
could potentially move we have to make sure that we don't leave 
references to regions that have been evacuated on any marking data 
structure. To do that we treat as roots all entries on the taskqueues / 
global stack and drained all SATB buffers (both active buffers and also 
enqueued buffers).

The first issue with the above interaction is that it has performance 
issues. Draining all SATB buffers and scanning the mark stack and 
taskqueues has been shown to be very time-consuming in some cases. Also, 
having to check whether objects are marked and propagate the marks 
appropriately during GC is an extra overhead.

The second issue is that it has been shown to be very fragile. We have 
discovered and fixed many issues over time which were subtle and hard to 
reproduce.

We really need to simplify the GC/marking interaction to both improve 
performance of GCs during marking, as well as improve our reliability. 
This changeset does exactly that.

* Explanation of the changes

The goal is to ensure that all the objects that are copied by the GC do 
not need to be visited by the marking threads and as a result do not 
need to be explicitly marked, pushed, etc.

The first observation is that most objects copied during a GC are 
allocated after marking starts and are therefore implicitly live. This 
is the case for all objects on Eden regions, as well as most objects on 
Survivor regions. The only exception are objects on the Survivor regions 
during the initial-mark pause. Unfortunately, it's not easy to track 
those separately as they will get mixed in with future Survivors. The 
first decision to deal with this is to turn off Survivors during the 
initial-mark pause. This ensures that all objects copied during each 
subsequent GC will only visit objects that have been allocated since 
marking started and are therefore implicitly live (i.e., over NTAMS). 
This allows us to totally eliminate that code that propagates marks 
during the GC. We just have to make sure that all copied objects are 
over NTAMS. Turning off Survivors during an initial-mark pause is a bit 
of a "big hammer" approach, but it will suffice for now. We have ideas 
on how to re-enable them in the future and we'll explore a couple of 
alternatives.

Given that the GC only copies objects that are implicitly marked it 
follows that none of the objects that are copied during any GC should 
appear on either the taskqueues nor the global marking stack. Also 
remember that we filter SATB buffers before enqueueing them which will 
filter out all implicitly marked objects. It follows that no enqueued 
SATB buffer should have references to objects that are being moved. This 
leaves the currently active SATB buffers given that the code that 
populates them is unconditional. But if we run the filtering on those 
during each GC such "offending" references are also quickly eliminated. 
So, instead of having to scan all stacks and all SATB buffers we only 
have to filter the active SATB buffers, which should be much, much faster.

* Implementation Notes

The actual changes are not too extensive as they basically mostly 
disable functionality in the GC code. The tricky part was to get the 
TAMS fields correct at various phases (start of copying, start of 
marking, etc.) and especially when an evacuation failure occurs. I put 
all that functionality in methods on HeapRegion which do the right thing 
when a GC starts, a marking starts, etc.

The most important changes are in the "main" GC code, i.e. 
G1ParCopyHelper::do_oop_work() and 
G1ParCopyHelper::copy_to_survivor_space(). Instead of having to 
propagate marks we only now need to mark objects directly reachable from 
roots during the initial-mark pause. The resulting code is much 
simplified (and hopefully more performant!).

I also added a method verify_no_cset_oops() which checks that indeed all 
the marking data structures do not point to regions that are being GCed 
at the start / end of each GC. (BTW, I'm considering adding a develop 
flag to enable this on demand.)

I should point out that this changeset will leave a lot of dead code. 
However, I took the decision to keep the changes to a minimum in order 
not overwhelm the code reviewers and make the important changes clearer. 
(I also discussed this with a couple of potential code reviewers and 
they agreed that this is a good approach.) I temporarily added 
guarantees to ensure that methods that should not be called are not 
called. I will remove all dead code with a future push.

I also have to apologize to John Cuthbertson for removing a lot of code 
he's added to deal with various bugs we had in the GC/marking 
interaction. Hopefully the new code will be less fragile compared to 
what we've had so far and John will be able to concentrate on more 
interesting features than trying to track down hard-to-reproduce failures!

Tony