RFR: 8296419: [REDO] JDK-8295319: pending_cards_at_gc_start doesn't include cards in thread buffers

Tue Nov 15 14:24:16 UTC 2022

On Wed, 9 Nov 2022 08:59:55 GMT, Thomas Schatzl <tschatzl at openjdk.org> wrote:

>> Let's try this again.  This is the original JDK-8295319 change, plus a couple
>> of small additions to fix problems with that original change.
>> 
>> The description of the original change was:
>> 
>> -----
>> 
>> Please review this change to G1 to include the per-thread buffers in the
>> number of pending cards at the start of a young GC.
>> 
>> DCQS::concatenate_logs has been renamed to concatenate_logs_and_stats, and now
>> also merges the per-thread refinement stats during the thread walk to flush
>> buffers. That replaces the separate thread walk to merge and record these
>> stats earlier in the GC. The merged stats and related info don't seem to be
>> needed until after the buffer flushing.
>> 
>> Also, when abandoning dirty card buffers and stats because of a full GC, fixed
>> to also abandon any buffers in the paused buffers lists.
>> 
>> -----
>> 
>> The problem was that by moving log concatenation earlier, it no longer
>> followed the call to retire_tlabs().  Among other things, that function
>> calls flush_deferred_card_mark_barrier() on each thread, possibly adding cards
>> to the thread's dirty card queue.  Doing that after the log concatenation may
>> lead to unprocessed cards in thread queues, with chaos ensuing.
>> 
>> To fix this, the call to retire_tlabs has also been moved, so that it once
>> again precedes the log concatenation.  Also added (debug-only) verification
>> that all thread dirty card queues are empty at the end of
>> pre_evacuate_collection_set.
>> 
>> A possible followup is to refactor those two operations, which each do a
>> (single-threaded) walk of all threads.  We could combine that into a single
>> walk.  We could also parallelize it if that seems warranted, though the
>> per-thread work is usually pretty small, so might not be worth parallelizing.
>> 
>> There might be an opportunity to do some similar refactoring for fullgc and
>> log abandonment, though there the two operations are done far away from each
>> other.
>> 
>> Testing:
>> mach5 tier1-6
>
> src/hotspot/share/gc/g1/g1YoungCollector.cpp line 1094:
> 
>> 1092:     // Flushes deferred card marks, so must precede concatenting logs.
>> 1093:     retire_tlabs();
>> 1094: 
> 
> I was wondering whether we should do something about all the methods (`retire_tlabs`, `concatenate_dirty_card_logs...`, `calculate_collection_set`) methods here that are part of the `pre_evacuate_collection_set` phase but are now located outside of that method.
> 
> Looking at the `per_thread_states` parameter passed to `pre_evacuate_collection_set`, it's only used for some verification inside `pre_evacuate_collection_set`, i.e. we could move that verification and initialization of the PSSS :) after`pre_evacuate_collection_set` instead.
> 
> Since the change touches that code, I think it is appropriate (and better) to consolidate right away.
> 
> I agree about that deferring combining and parallelizing the various phases of `pre_evacuate_collection_set` can be done extra; the reason why we want parallelization of the TLAB retiring is that there can be tens of thousands threads, and even little work per thread adds up (there is a CR for that already actually, [JDK-8211104](https://bugs.openjdk.org/browse/JDK-8211104)).
> 
> Other than that it looks good to me.

I made a prototype for this suggestion: https://github.com/openjdk/jdk/commit/587660c8ee198447ea82ddd61dc9d41f2c559daa that seems to at least compile. No testing. One may also move  `rdcqs` and `preserved_marks_set` completely into the `G1ParScanThreadStateSet` as neither is used elsewhere.

-------------

PR: https://git.openjdk.org/jdk/pull/11053