CRR (L / updated): 6888336: G1: avoid explicitly marking and pushing objects in survivor spaces
Tony Printezis
tony.printezis at oracle.com
Sat Jan 7 04:43:34 UTC 2012
Hi again,
Here's an updated webrev after I merged my changes with John's latest push:
http://cr.openjdk.java.net/~tonyp/6888336/webrev.3/
The code is basically the same, I just had to move some of it to
different places.
Testing update: I've been testing the changes continuously on three
machines over the holidays and I haven't seen any failures since the
single failure over Xmas which was caused by the race in the array
chunking changes which has now been resolved. I did additional testing
with a patch from John (thanks again!) which artificially forces
evacuation failures to stress that code and, again, I saw no issues with
that either.
Tony
On 01/05/2012 03:17 PM, Tony Printezis wrote:
> Hi all,
>
> Updated webrev after making some changes based on comments from John
> (thanks John!):
>
> http://cr.openjdk.java.net/~tonyp/6888336/webrev.2/
>
> I'd like to clarify something: this change relies on the array
> chunking changes (7121623) but the webrev does not include those
> changes (despite what the index page says). So, if you want to try
> this patch out you'll need to apply the array chunking changes first.
>
> Tony
>
> On 12/27/2011 01:05 PM, Tony Printezis wrote:
>> Hi all,
>>
>> Here's an updated webrev for this change that takes into account the
>> new approach of chunking object arrays (see previous e-mails on
>> 7121623):
>>
>> http://cr.openjdk.java.net/~tonyp/6888336/webrev.1/
>>
>> If anything else the new approach simplified the code a bit since now
>> we can always read an object's size from its from-image instead of
>> having to check one or the other depending on whether it's a chunked
>> array or not. I also moved the body of some methods from
>> heapRegion.hpp to the .inline.hpp and .cpp files (as they were
>> getting a bit large to keep in the .hpp file).
>>
>> Tony
>>
>> On 12/21/2011 05:37 PM, Tony Printezis wrote:
>>> Hi all,
>>>
>>> I'd like a couple of code reviews for the following non-trivial
>>> changes (large, not necessary in lines of code modified but more due
>>> to the fact that the evacuation pause / concurrent marking
>>> interaction is changed quite dramatically):
>>>
>>> http://cr.openjdk.java.net/~tonyp/6888336/webrev.0/
>>>
>>> Here's some background, motivation, and a summary of the changes (I
>>> felt that it was important to write a longer then usual explanation):
>>>
>>> * Background / Motivation
>>>
>>> Each G1 heap region has a field top-at-mark-start (aka TAMS) which
>>> denotes where the top of the region was when marking started. An
>>> object is considered implicitly live if it's over TAMS (i.e., it was
>>> allocated since marking started) or explicitly live if it's below
>>> TAMS (i.e., it was allocated before marking started) and marked on
>>> the bitmap. (It follows that it's unnecessary to explicitly mark
>>> objects over TAMS.)
>>>
>>> In fact, we have two copies of the above marking information: "Next
>>> TAMS / Next Bitmap" and "Prev TAMS / Prev Bitmap". Prev is the copy
>>> that was obtained by the last marking cycle that was successfully
>>> completed (so, it is consistent: all live objects should appear as
>>> live in the prev marking information). Next is the copy that will be
>>> obtained / is currently being obtained and it's not consistent
>>> because it's not guaranteed to be complete.
>>>
>>> G1 uses SATB marking which has the advantage not to require objects
>>> allocated since the start of marking to be visited at all by the
>>> marking threads (they are implicitly live and they do not need to be
>>> scanned). So, the active marking cycle can totally ignore objects
>>> over NTAMS (since they have been allocated since marking started).
>>>
>>> The current interaction between evacuation pauses (let's call these
>>> "GCs" from now on) and concurrent marking is very tricky. Even
>>> though marking ignores all objects over NTAMS (currently: all
>>> objects in Eden regions) it still has to visit and mark objects in
>>> the Survivors regions. But those will be moved by subsequent GCs.
>>> So, a GC needs to be aware that it's moving objects that have been
>>> marked by the marking threads and not only propagate those marks but
>>> also notify the marking threads that said objects have been moved.
>>> For that we use several data structures: pushes to the global
>>> marking stack and also to what's referred to as the "region stack"
>>> which is only used by the GC to push a group of objects instead of
>>> pushing them individually ("region" here is a mem region and
>>> smaller than a G1 region).
>>>
>>> Additionally, because the marking threads could come across objects
>>> that could potentially move we have to make sure that we don't leave
>>> references to regions that have been evacuated on any marking data
>>> structure. To do that we treat as roots all entries on the
>>> taskqueues / global stack and drained all SATB buffers (both active
>>> buffers and also enqueued buffers).
>>>
>>> The first issue with the above interaction is that it has
>>> performance issues. Draining all SATB buffers and scanning the mark
>>> stack and taskqueues has been shown to be very time-consuming in
>>> some cases. Also, having to check whether objects are marked and
>>> propagate the marks appropriately during GC is an extra overhead.
>>>
>>> The second issue is that it has been shown to be very fragile. We
>>> have discovered and fixed many issues over time which were subtle
>>> and hard to reproduce.
>>>
>>> We really need to simplify the GC/marking interaction to both
>>> improve performance of GCs during marking, as well as improve our
>>> reliability. This changeset does exactly that.
>>>
>>> * Explanation of the changes
>>>
>>> The goal is to ensure that all the objects that are copied by the GC
>>> do not need to be visited by the marking threads and as a result do
>>> not need to be explicitly marked, pushed, etc.
>>>
>>> The first observation is that most objects copied during a GC are
>>> allocated after marking starts and are therefore implicitly live.
>>> This is the case for all objects on Eden regions, as well as most
>>> objects on Survivor regions. The only exception are objects on the
>>> Survivor regions during the initial-mark pause. Unfortunately, it's
>>> not easy to track those separately as they will get mixed in with
>>> future Survivors. The first decision to deal with this is to turn
>>> off Survivors during the initial-mark pause. This ensures that all
>>> objects copied during each subsequent GC will only visit objects
>>> that have been allocated since marking started and are therefore
>>> implicitly live (i.e., over NTAMS). This allows us to totally
>>> eliminate that code that propagates marks during the GC. We just
>>> have to make sure that all copied objects are over NTAMS. Turning
>>> off Survivors during an initial-mark pause is a bit of a "big
>>> hammer" approach, but it will suffice for now. We have ideas on how
>>> to re-enable them in the future and we'll explore a couple of
>>> alternatives.
>>>
>>> Given that the GC only copies objects that are implicitly marked it
>>> follows that none of the objects that are copied during any GC
>>> should appear on either the taskqueues nor the global marking stack.
>>> Also remember that we filter SATB buffers before enqueueing them
>>> which will filter out all implicitly marked objects. It follows that
>>> no enqueued SATB buffer should have references to objects that are
>>> being moved. This leaves the currently active SATB buffers given
>>> that the code that populates them is unconditional. But if we run
>>> the filtering on those during each GC such "offending" references
>>> are also quickly eliminated. So, instead of having to scan all
>>> stacks and all SATB buffers we only have to filter the active SATB
>>> buffers, which should be much, much faster.
>>>
>>> * Implementation Notes
>>>
>>> The actual changes are not too extensive as they basically mostly
>>> disable functionality in the GC code. The tricky part was to get the
>>> TAMS fields correct at various phases (start of copying, start of
>>> marking, etc.) and especially when an evacuation failure occurs. I
>>> put all that functionality in methods on HeapRegion which do the
>>> right thing when a GC starts, a marking starts, etc.
>>>
>>> The most important changes are in the "main" GC code, i.e.
>>> G1ParCopyHelper::do_oop_work() and
>>> G1ParCopyHelper::copy_to_survivor_space(). Instead of having to
>>> propagate marks we only now need to mark objects directly reachable
>>> from roots during the initial-mark pause. The resulting code is much
>>> simplified (and hopefully more performant!).
>>>
>>> I also added a method verify_no_cset_oops() which checks that indeed
>>> all the marking data structures do not point to regions that are
>>> being GCed at the start / end of each GC. (BTW, I'm considering
>>> adding a develop flag to enable this on demand.)
>>>
>>> I should point out that this changeset will leave a lot of dead
>>> code. However, I took the decision to keep the changes to a minimum
>>> in order not overwhelm the code reviewers and make the important
>>> changes clearer. (I also discussed this with a couple of potential
>>> code reviewers and they agreed that this is a good approach.) I
>>> temporarily added guarantees to ensure that methods that should not
>>> be called are not called. I will remove all dead code with a future
>>> push.
>>>
>>> I also have to apologize to John Cuthbertson for removing a lot of
>>> code he's added to deal with various bugs we had in the GC/marking
>>> interaction. Hopefully the new code will be less fragile compared to
>>> what we've had so far and John will be able to concentrate on more
>>> interesting features than trying to track down hard-to-reproduce
>>> failures!
>>>
>>> Tony
>>>
More information about the hotspot-gc-dev
mailing list