RFR (M): 6672778: G1 should trim task queues more aggressively during evacuation pauses

Thu Apr 12 15:15:01 UTC 2018

Hi Thomas,

On 2018-04-11 13:46, Thomas Schatzl wrote:
> Hi all,
> 
>    I updated and (hopefully) improved the change a bit after some more
> thinking.
> 
> Particularly I thought that tracking the partial trim time in the
> closures would be confusing and too intrusive, so I moved it out from
> them to the G1ParScanThreadState.
> This also removed quite a few otherwise necessary includes to
> utilities/ticks.hpp.
> 
> Also the time accounting has been a bit messy as you needed to
> add/subtract the trim time in various places that were non-obvious. I
> tried to improve that by avoiding (existing) nested timing (remembered
> set cards/remembered set code roots and updateRS/scanHCC), which made
> the code imho much more easier to follow.
> 
> These two changes also make the follow-up "JDK-8201313: Sub-phases of
> ext root scan may be larger than the sum of individual timings"
> somewhat simpler.
> 
> Note that to gather the timings the code uses Tickspan for holding
> intermediate values, i.e. basically nanoseconds. Unfortunately G1
> logging uses seconds encoded as doubles everywhere for historical
> reasons; there is a really huge change fixing this coming, but for now
> more and more places are going to use Tickspan.
> 
> Webrevs:
> http://cr.openjdk.java.net/~tschatzl/6672778/webrev.0_to_1/ (diff)
> http://cr.openjdk.java.net/~tschatzl/6672778/webrev.1/ (full)

Thanks for this update, I had a glance at the first version and I think 
this is indeed simpler. Still have some comments though:
src/hotspot/share/gc/g1/g1CollectedHeap.cpp
3129 root_processor->evacuate_roots(pss, pss->closures(), worker_id);

Only pass in pss, and get the closures inside evactuate_roots.
-----
src/hotspot/share/gc/g1/g1SharedClosures.hpp
48 double trim_time_sec() {
49   return TicksToTimeHelper::seconds(_oops.trim_ticks_and_reset()) +
50     TicksToTimeHelper::seconds(_oops_in_cld.trim_ticks_and_reset());
51 }

This could be removed now, right?
-----

Apart from this me and Thomas has also spoken a bit offline so there 
might be some additional changes.

Thanks,
Stefan

> Testing:
> hs-tier 1-3
> 
> Thanks,
>    Thomas
> 
> On Mon, 2018-04-09 at 13:20 +0200, Thomas Schatzl wrote:
>> Hi all,
>>
>>    I am happy to finally bring this, one of the oldest G1 issues we
>> have, to a happy ending :)
>>
>> So until now G1 buffered all oop locations it encountered during root
>> scanning (including from remembered sets and refinement queues) in
>> the
>> per-thread work queues, and only drained them at the very end of the
>> evacuation pause.
>>
>> I am not completely sure why this has been implemented this way, but
>> it
>> has serious drawbacks:
>> - the work queues and overflow stacks may use a lot of memory, and I
>> mean *a lot*
>> - since we buffer all oop references, the prefetching G1 does goes to
>> waste as G1 always prefetches (during root scan) without following up
>> on it, wasting memory bandwidth.
>>
>> Other GC's already employ this technique, so my best guess why G1 did
>> not so far is that G1 needs sub-timings for the various phases to get
>> prediction right, and even if doing timing is cheap, doing it too
>> often
>> just adds up.
>>
>> Anyway, this problem has been solved by implementing a hysteresis,
>> i.e.
>> start trimming the work queues at a threshold higher than ending it,
>> and time the length of the trimming inbetween. So the timing
>> measurement overhead gets distributed across many work queue
>> operations.
>> Note that I did not do much testing about the optimal hysteresis
>> range,
>> the suggested guess of 2xGCDrainStackTargetSize seems to be a pretty
>> good value (i.e. reduces overhead well enough).
>>
>> Results are pretty good: I have seen reductions of the maximum task
>> queue size by multiple orders of magnitudes (i.e. less memory usage
>> during GC), and reduction of total pause time by up to 50%,
>> particularly on larger applications in the few GB heap range where
>> quite a bit of data is copied around every gc.
>>
>> But also smaller applications and applications doing less copying
>> benefit quite a bit, reducing pause times significantly.
>>
>> Note that there is a known, actually pre-existing bug with buffering
>> up
>> references (by the now obsolete and removed BufferingOopClosure): the
>> sum of timings for the sub-phases of ext root scan may be larger than
>> the printed total. This will be addressed with the follow-up JDK-
>> 8201313 to keep this change small, and it's a pre-existing issue
>> anyway
>> :P
>>
>> CR:
>> https://bugs.openjdk.java.net/browse/JDK-6672778
>> Webrev:
>> http://cr.openjdk.java.net/~tschatzl/6672778/webrev/
>> Testing:
>> hs-tier 1-3, perf tests
>>
>> Thanks,
>>    Thomas
>>
>