CRR (M): 7127706: G1: re-enable survivors during the initial-mark pause

Fri Jan 20 17:13:39 UTC 2012

Hi all,

New webrev for this based on comments from John:

http://cr.openjdk.java.net/~tonyp/7127706/webrev.3/

Now, the CMRootRegions::claim_next() method does not check the 
ConcurrentMark::has_aborted() flag to know when to abort (and return 
NULL) but, instead, a _has_aborted flag I added to the CMRootRegions 
class. This is set to true at the start of a Full GC and before the Full 
GC waits for the root region scan to finish.

Additionally, I now call _root_regions.reset() from 
CM::checkpointRootsInitialPost()instead of calling it explicitly from 
do_collection_pause_at_safepoint().

Tony

On 01/19/2012 09:40 AM, Tony Printezis wrote:
> Bengt (and all),
>
> Updated webrev using "root regions" now:
>
> http://cr.openjdk.java.net/~tonyp/7127706/webrev.2/
>
> Tony
>
> On 01/19/2012 09:04 AM, Tony Printezis wrote:
>> Bengt,
>>
>> Inline (again!)
>>
>> On 1/19/2012 4:03 AM, Bengt Rutisson wrote:
>>>
>>> Tony,
>>>
>>> On 2012-01-18 22:31, Tony Printezis wrote:
>>>> Bengt,
>>>>
>>>> Here's a webrev with the renaming:
>>>>
>>>> http://cr.openjdk.java.net/~tonyp/7127706/webrev.1/
>>>>
>>>> I have to say I'm not sure I really like the term "initial-mark / 
>>>> IM snapshot regions". I'll try to come up with an alternative name 
>>>> for them....
>>>
>>> Looked quickly at the new webrev.
>>>
>>> I agree that IM-snapshot might not be optimal. Still I like the fact 
>>> that it is not just "snapshot" since I think that can easily be 
>>> confused with the SATB terms. 
>>
>> Well, it's supposed to be. In SATB, anything that's reachable from 
>> the "snapshot" at initial-mark time will be retained. Since we want 
>> to avoid explicitly marking the survivors, we make them all 
>> implicitly live and part of the "snapshot" which is why we have to 
>> scan them (the same way we scan roots during the initial-mark pause). 
>> So, "snapshot regions" is not unreasonable given that if we were not 
>> using SATB we would not be able to do this.
>>
>>> It is of course part of the SATB snapshot, but not the whole thing.
>>>
>>> Just thinking aloud here, what about not using the word "snapshot" 
>>> at all? How about "to_be_scanned_regions", "root_regions" or 
>>> "concurrent_roots"?
>>
>> I like "root regions". Let's go with that (it's shorter too!).
>>
>> Tony
>>
>>> Really not sure what a good name is here...I kind of like the log 
>>> message "Concurrent root scanning took 0.000x ms".
>>>
>>> Bengt
>>>
>>>>
>>>> Tony
>>>>
>>>> Tony Printezis wrote:
>>>>> Hi Bengt,
>>>>>
>>>>> Thanks for looking at this so quickly! Inline.
>>>>>
>>>>> Bengt Rutisson wrote:
>>>>>>
>>>>>> Tony,
>>>>>>
>>>>>> Overall this looks really good. Thanks for fixing it.
>>>>>>
>>>>>> Some comments:
>>>>>>
>>>>>> First, a general question regarding naming and logging. We now 
>>>>>> talk about "snapshot" a lot. It is a pretty good name, but maybe 
>>>>>> it needs some more context to be understandable in the code and 
>>>>>> the GC log. I don't have any really good names, but maybe 
>>>>>> "survivor_snapshot"
>>>>>
>>>>> I'd rather not mention "survivors" given that we might add 
>>>>> non-survivor regions in the future.
>>>>>
>>>>>> or "initial_mark_snapshot"?
>>>>>
>>>>> I like "initial-mark snapshot" better. Having said that 
>>>>> CMInitialMarkSnapshotRegions and _initial_mark_snapshot_regions 
>>>>> are kinda long. :-) I'll abbreviate to CMIMSnapshotRegions and 
>>>>> _im_snapshot_regions if that's OK.
>>>>>
>>>>>> concurrentMark.inline.hpp
>>>>>>
>>>>>>   if (hr == NULL) {
>>>>>>     hr = _g1h->heap_region_containing_raw(addr);
>>>>>>     // Given that we're looking for a region that contains an object
>>>>>>     // header it's impossible to get back a HC region.
>>>>>>     assert(!hr->continuesHumongous(), "sanity");
>>>>>>   } else {
>>>>>>     assert(hr->is_in(addr), "pre-condition");
>>>>>>   }
>>>>>>
>>>>>> The first assert should probably hold even for regions that are 
>>>>>> passed in to grayRoot() right? So, maybe something like:
>>>>>>
>>>>>>   if (hr == NULL) {
>>>>>>     hr = _g1h->heap_region_containing_raw(addr);
>>>>>>   } else {
>>>>>>     assert(hr->is_in(addr), "pre-condition");
>>>>>>   }
>>>>>>   // Given that we need a region that contains an object
>>>>>>   // header it's impossible for it to be a HC region.
>>>>>>   assert(!hr->continuesHumongous(), "sanity");
>>>>>
>>>>> Good observation! I changed to the above.
>>>>>
>>>>>> concurrentMarkThread.cpp
>>>>>>
>>>>>> ConcurrentMarkThread::run()
>>>>>>
>>>>>> Why do we do the explicit time/date stamping?
>>>>>>
>>>>>>           gclog_or_tty->date_stamp(PrintGCDateStamps);
>>>>>>           gclog_or_tty->stamp(PrintGCTimeStamps);
>>>>>>           gclog_or_tty->print_cr("[GC 
>>>>>> concurrent-snapshot-scan-start]");
>>>>>>
>>>>>> why is it not enough with the normal -XX:+PrintGCTimeStamps 
>>>>>> information? 
>>>>>
>>>>> Not quite sure what you mean with "is it not enough with the 
>>>>> normal ... information". Each log record needs either a GC time 
>>>>> stamp or a GC date stamp and we have to print either or both 
>>>>> depending on the two -XX parameters. Unfortunately, the logging 
>>>>> code has not been well abstracted and/or refactored so we have 
>>>>> this unfortunate replication throughout the GCs.
>>>>>
>>>>>> This is probably correct since I see this pattern in other 
>>>>>> places. But I would like to understand why we do it.
>>>>>>
>>>>>>
>>>>>> g1CollectedHeap.cpp:
>>>>>>
>>>>>> G1CollectedHeap::do_collection()
>>>>>>
>>>>>> Is it worth logging how long we had to wait in 
>>>>>> _cm->snapshot_regions()->wait_until_scan_finished(), the same way 
>>>>>> that we do in G1CollectedHeap::do_collection_pause_at_safepoint()?
>>>>>
>>>>> Currently, the GC log records for the evacuation pauses have a lot 
>>>>> of extra information when +PrintGCDetails is set and it was 
>>>>> reasonable to add an extra record with the wait time. And it's 
>>>>> more important to know how the wait for snapshot region scanning 
>>>>> affects evacuation pauses, which are more critical. The Full GC 
>>>>> log records are currently one line and I don't think we want to 
>>>>> extend them further (at least, not before we put a decent GC 
>>>>> logging framework in place). On the other hand, the snapshot 
>>>>> region scanning aborts when a marking cycle is aborted due to a 
>>>>> Full GC. So, this wait time should not be long. How about I add a 
>>>>> comment in the code saying that, when we introduce a more 
>>>>> extensible logging framework, we could add the wait time to the 
>>>>> Full GC log records? Something like:
>>>>>
>>>>>    // Note: When we have a more flexible GC logging framework that
>>>>>    // allows us to add optional attributes to a GC log record we
>>>>>    // could consider timing and reporting how long we wait in the
>>>>>    // following two methods.
>>>>>     wait_while_free_regions_coming();
>>>>>    // ...
>>>>>    _cm->snapshot_regions()->wait_until_scan_finished();
>>>>>
>>>>>
>>>>>> Finally, just some food for thought. Could this be generalized to 
>>>>>> more roots? I mean take a snapshot and scan it concurrently.
>>>>>
>>>>> By scanning the IM snapshot regions we say "these guys are all 
>>>>> roots, instead of scanning them during the GC we will scan them 
>>>>> concurrently". And we can do that for any object / region a) as 
>>>>> long as we know they will not move while they are being scanned 
>>>>> and b) because we have the pre-barrier. If any references on the 
>>>>> snapshot objects are updated, the pre-barrier ensures that their 
>>>>> values at the start of marking will be enqueued and processed.
>>>>>
>>>>> For external arbitrary roots we don't have a write barrier (and we 
>>>>> shouldn't as it'd be too expensive). So, we cannot do that for 
>>>>> non-object roots without a "pre-barrier"-type mechanism.
>>>>>
>>>>> Tony
>>>>>
>>>>>
>>>>>> Bengt
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 2012-01-18 00:48, Tony Printezis wrote:
>>>>>>> Hi all,
>>>>>>>
>>>>>>> Can I have a couple of code reviews for this change that 
>>>>>>> re-enables the use of survivor regions during the initial-mark 
>>>>>>> pause?
>>>>>>>
>>>>>>> http://cr.openjdk.java.net/~tonyp/7127706/webrev.0/
>>>>>>>
>>>>>>> From the CR:
>>>>>>>
>>>>>>> We could scan the survivors as we're copying them, however this 
>>>>>>> will require more work during the initial-mark GCs (and in 
>>>>>>> particular: special-case code in the fast path).
>>>>>>>
>>>>>>> A better approach is to let the concurrent marking threads scan 
>>>>>>> the survivors and mark everything reachable from them a) before 
>>>>>>> any more concurrent marking work is done (so that we can just 
>>>>>>> mark the objects, without needing to push them on a stack, and 
>>>>>>> let the "finger" algorithm discover them) and b) before the next 
>>>>>>> GC starts (since, if we copy them, we won't know which of the 
>>>>>>> new survivors are the ones we need to scan).
>>>>>>>
>>>>>>> This approach has the advantage that it does not require any 
>>>>>>> extra work during the initial-mark GCs and all the work is done 
>>>>>>> by the concurrent marking threads. However, it has the 
>>>>>>> disadvantage that the survivor scanning might hold up the next 
>>>>>>> GC. In most cases this should not be an issue as GCs take place 
>>>>>>> at a reasonably low rate. If it does become a problem we could 
>>>>>>> consider the following:
>>>>>>>
>>>>>>> - like when the GC locker is active, try to extend the eden to 
>>>>>>> give a bit more time to the marking threads to finish scanning 
>>>>>>> the survivors
>>>>>>> - instead of waiting for the marking threads, a GC can take over 
>>>>>>> and finish up scanning the remaining survivors (typically, we 
>>>>>>> have more GC threads than marking threads, so the overhead will 
>>>>>>> be reduced)
>>>>>>> - if we supported region pinning, we could pin all the regions 
>>>>>>> that were not scanned by the time the GC started so that the 
>>>>>>> marking threads can resume scanning them after the GC completes
>>>>>>>
>>>>>>> Implementation notes:
>>>>>>>
>>>>>>> I introduced the concept of a "snapshot regions" in the 
>>>>>>> ConcurrentMark which is a set of regions that need to be scanned 
>>>>>>> at the start of a concurrent cycle. Currently, these can only be 
>>>>>>> survivors but maybe we can use the same concept for something 
>>>>>>> else in the future.
>>>>>>>
>>>>>>> Tony
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>