RFR: 8272083: G1: Record iterated range for BOT performance during card scan

Tue Aug 24 11:45:26 UTC 2021

On Sat, 7 Aug 2021 04:56:08 GMT, Yude Lin <github.com+16811675+linade at openjdk.org> wrote:

> A fix to the problem in 8272083 is to use a per-worker pointer to indicate where the worker has scanned up to, similar to the _scanned_to variable. The difference is this pointer (I call it _iterated_to) records the end of the object and _scanned_to records the end of the scan. Since we always scan with increasing addresses, the end of the latest object scanned is also the address where BOT has fixed up to. So we avoid having to fix below this address when calling block_start(). This implementation approximately reduce the number of calls to set_offset_array() during scan_heap_roots() 2-10 times (in my casual test with -XX:G1ConcRefinementGreenZone=1000000).
> 
> What this approach not solving is random access to BOT. So far I haven't found anything having this pattern.

Hi,

> 
> > > If concurrent refinement tries to refine a card, it will probably run into an unfixed part of BOT. We prevent this by requiring the refiner to fix where the card is pointing at. If other worker has claimed the chunk where the card is pointing at, then refinement can proceed without a guarantee the chunk is actually fixed. This not often happens but might be a problem.
> > 
> > 
> > This paragraph indicates that the refinement thread never fixes up the BOT, and there is some synchronization mechanism between BOT-fix thread and the refinement threads.
> > Not sure how this works, but it depends on how the chunks for BOT-fixup are claimed - maybe the refinement thread could fix the BOT for that chunk first (if unclaimed)? Just an idea, could the refinement threads do that BOT-fixup work first "before" (or during) refining work?
> > E.g. if a refinement thread gets a card, first check if there is a corresponding BOT-fixup chunk to do, and do that first, doing all to be refined cards in that PLAB area at the same time (or don't do that at the same time - there are a few options here).
> > I do not think leaving some work for card scanning is an issue as long as most work is done.
> 
> What I meant here is when the refinement thread refines a card, it will first try to claim all the unfixed chunks in the same region, and fix them, up to a point where the it knows that "where the card points into has been fixed, I will not run into unfixed BOT if I refine it". So I think it's like what you suggested. Refinement threads and fixer threads are synchronized by the chunk index. If one claims a chunk, the other will look for the next.
> 
> An issue is that the chunks of a region is claimed from bottom to top, there is no easy way to fix a random one in the middle. It means the refinement can be stall for some time but it won't be too long. I imagine, if we are using a queue to manage the BOT-fixing tasks, there is still an order imposed. The refinement thread has to dequeue and fix everything until where the card points into is fixed. But even taking into account the time to do all the fixing, the refinement rate increases (in my experiment). So I think it's worth it?

Fixing the BOT concurrently seems worthwhile so far, no worries.

The current work assignment heuristics seem to be suboptimal though and probably assumes too much that there is only a single refinement thread (which isn't true even in your case - any java application thread may also start refining at some point).

However the main drawback imho for the (or just one) refinement thread fixing everything (for a region - afaiu) seems to be that this operation might take too long. Our ballpark figure for time of the time allowance for a single task without checking back for a pending GC should only be up to `G1ConcMarkStepDurationMillis` (default 10ms, but that is rather on the high side nowadays; for another example, the remembered set rebuild takes 256kB chunks which on at that time current machines typically took < 1ms iirc), and processing all chunks for a region seems to take longer than that given the outliers the graphs showed.

It's unfortunate that PLABs can get really large, so even just recording PLAB starts may be too coarse of a BOT-fixup granularity. However, before thinking about improving parallelism there, let's solve the initial problem :)

The length/size of the PLABs is also the reason why you want that operation parallelized even in the concurrent phase.

> 
> > > If another gc pause is scheduled, we need to wait for the fixing phase to complete. Because we assume that "BOT is going to be fixed almost entirely anyway. Might as well do it sooner". We can potentially abort though, and handle the unfixed part in some other way.
> > 
> > 
> > This paragraph indicates to me that the BOT-fix thread has higher priority than the GC pause, which seems to be the wrong way to look at the problem - in this case I would rather have the GC eat the cost of fixing up the BOT as the BOT fixup seems supplementary.
> > Delaying remaining BOT-fixing work for after the GC pause seems not to be a good idea, this seems to only result in forever piling up work (assuming in steady state the GC is always faster than completing that work).
> > Not completely sure if this is what you meant.
> > Or did you mean that in the pause some extra(?) phase completes the fixup work, which could be another option?
> > However just leaving the remainder of the work for the card scan, even if it sometimes does duplicate work does not seem to be an issue to me. This problem of duplicate work is probably largest on large heaps, where the time between pauses allows for complete concurrent BOT-fixup anyway.
> 
> What I did here is letting the gc thread wait for fixing _in the safepoint_, when everything is paused. I think it's equivalent to adding another phase in the gc pause, but it seems I could be wrong. The data shows most of the time the waiting finishes within 

During GC you should/could use all worker threads allowed for such things, which is faster - and maybe needs additional information to parallelize well.

> 0.1ms, but occasionally >10ms or even >100ms, which is causing part of the problem you mentioned about the outlier of the pause time. I'm yet to understand why. I'm sure something can be engineered to avoid this.
> 
> But I agree we should probably abort the fixing when gc happens, if we consider the role of fixing as always supplementary to gc. I also realized afterwards that this waiting time should be considered by the pause time policy if we don't abort.
> 
> > Actually, you could probably directly use a G1CardSet to store these, probably with some tuned settings, you might not need an extra data structure at all (depending on the needs on that data structure). The G1CardSet is a fairly compact representation.
> > Another option could be also just store a pointer to the first PLAB per region where this work needs to be done (similar to G1CMRootMemRegions); the additional PLAB boundaries could be recalculated by using the (known) PLAB size. I.e. base + x * PLABSize gives you the x+1`th PLAB start.
> 
> I'll check out these data structures. Thanks for the pointer~
> 
> > Direct allocations are a (small) problem, as they change that boundary, however I think these can be detected assuming the refinement threads do not interfere; given a cur pointer within that region, which indicates where other threads need to continue fixing up the BOT (until top() of course), if the BOT entry at card(cur + PLABSize) does not exactly point to base + PLABSize (i.e. the start of the next PLAB), this must have been a single object (or the first object's size is larger than a PLAB), so just adjust cur by that single object's size. I think the BOT of these single objects is already correct, so it can be skipped.
> 
> But I still want to argue against using a queue or other extra data structure. An additional point (on top of space consumption) is that I don't have to make modification to the allocation code if I just use the old_top()/new_top() scheme. I checked how to modify them. It involves changing too many interfaces in G1AllocRegion/HeapRegion/G1Allocator, in order to pass down the information "am I allocating a plab or a large object?", to where the locked allocation actually happens. (Maybe we can discuss this over code when it's uploaded but) we do want to do enqueue when the HeapRegion is locked, right?

The code in `G1ParScanThreadState::allocate_copy_slow()` is responsible for allocating either a (new) PLAB or directly. At most the called `G1PLABAllocator::allocate_direct_or_new_plab` should be changed to return the actual allocation size to, in addition to the base pointer, get the actual allocation size into `allocate_copy_slow` that could record that.
Fwiw, the call to `report_promotion_event()` in that method is actually notifying JFR of a new PLAB or direct allocation (note that it passes the `word_sz` of the object that caused that allocation, not the size of the block you got).

So this should be fairly easy to add a hook at there.

Of course it is probably most fruitful to discuss further implementation details on the actual code.

> 
> > I.e. if the chunk size for card scan is e.g. 16k words, one option could be to only fix the BOT for PLABs crossing that 16k boundaries; since we know that during card scan threads are claiming cards per 16k chunks (idk right now what it actually is, probably higher), there can only be overlapped work happening across them.
> 
> I _think_, even if a plab doesn't cross a chunk boundary, if we try to call block_start() with addr1 and addr2 (plab_start < addr1 < addr2) in that order, there is possibly duplicated work. The first call will try to fix [plab_start, addr1], the second call will try to fix [start_of_object_that_covers(addr1), addr2]. There is an overlap between this two ranges. The reason is the second call is going back by too much. The BOT offset array element at addr2 is not updated when [plab_start, addr1] is fixed. It will go back to start_of_object_that_covers(addr1) (because the BOT element at addr1 is updated from plab_start to the real object the covers addr1).

That is true - I suggested to have that thread fix the BOT of that (single) entire PLAB (which might still be too large to do in one go; particularly specjbb20xx max out PLAB sizes).

> 
> > The graphs look promising so far; I'm only a bit concerned about the total pause time outliers (system.gc()?) as they seem a bit high. Also for some reason the command line above only configures 4 gc threads out of 16.
> 
> The pause time issue can be partially explained by the waiting time for fixing, which is to be removed as discussed earlier. I tend to use less threads for testing. It only has one concurrent gc thread. The results seem less random ; )
> 
> Thank you for the detailed and informative reply!

Thanks,
  Thomas

-------------

PR: https://git.openjdk.java.net/jdk/pull/5039