RFR: 8272083: G1: Record iterated range for BOT performance during card scan

Mon Aug 23 10:33:30 UTC 2021

On Sat, 7 Aug 2021 04:56:08 GMT, Yude Lin <github.com+16811675+linade at openjdk.org> wrote:

> A fix to the problem in 8272083 is to use a per-worker pointer to indicate where the worker has scanned up to, similar to the _scanned_to variable. The difference is this pointer (I call it _iterated_to) records the end of the object and _scanned_to records the end of the scan. Since we always scan with increasing addresses, the end of the latest object scanned is also the address where BOT has fixed up to. So we avoid having to fix below this address when calling block_start(). This implementation approximately reduce the number of calls to set_offset_array() during scan_heap_roots() 2-10 times (in my casual test with -XX:G1ConcRefinementGreenZone=1000000).
> 
> What this approach not solving is random access to BOT. So far I haven't found anything having this pattern.

First, thanks for all the work put into the graphs, they make discussion much easier, and provide a very good argument pursuing this further so far. Some comments purely based on what is written here:

> (For completeness, I'll restate the problem in 8272083.)
> 
> The block offset table (BOT) is something that serves the purpose
> "given a heap word, return the head of the object that covers this heap word"
> We typically use BOT to find the objects that covers the dirty memory marked by a dirty card.
> 
> We need to update BOT with the information it needs (obj addr, obj size) at allocation. But during gc, we update it with (plab addr, plab size). When allocating within a plab, we don't update the BOT. So after a gc, BOT becomes crude. BOT will refine itself lazily when it's being used, that is, it will walk the plabs and update itself according to the objects it found during the walk. The current algorithm is such that BOT will sometimes walk the same part of a plab more than once, which is unnecessary. The update operation has an atomic store for each 512-byte memory it walks.
> 
> This affects both concurrent refine and scan heap roots during pause.
> 
> Using a dedicated "concurrent BOT fixing phase" we can remove most of the duplicated updates, and also move most of the actually-necessary updates out of gc pause. The result is something like
> ![image](https://user-images.githubusercontent.com/16811675/130313687-e2c0ced3-f4fa-439c-b67b-1248d95df441.png)
> This is a zoom-in of a box plot which shows the fixing phase (indicated by "fix" in the label) can reduce the total scan heap roots time. It's more effective when the refinement strength is low (indicated by refine=x, x means the non-adaptive green threshold, and "auto" means using the default adaptive mode).
> 
> It can also increases the concurrent refinement speed, because it also reduces the work in there. The cost is a dedicated concurrent phase, which is described as follows.
> 
> 1. PLABs are created during evacuation to contain promoted objects from survivor regions and surviving objects from old regions (mixed gc only). In each old region, the area between the old top() and new top(), before and after an evacuation pause, is where the new PLABs are, which we will record.

Note that there are also direct allocations for "larger" objects.

> 
> 2. After the pause, a concurrent phase will be scheduled to fix BOT in these areas. Basically a fixing worker will claim a part of the area (we'll call it a chunk), and walk the memory for objects. If an objects crosses a BOT boundary, it will update the BOT entry. Some tweaks need to be made to make sure different workers never fix the same BOT entry.
> 
> 3. If concurrent refinement tries to refine a card, it will probably run into an unfixed part of BOT. We prevent this by requiring the refiner to fix where the card is pointing at. If other worker has claimed the chunk where the card is pointing at, then refinement can proceed without a guarantee the chunk is actually fixed. This not often happens but might be a problem.

This paragraph indicates that the refinement thread never fixes up the BOT, and there is some synchronization mechanism between BOT-fix thread and the refinement threads.

Not sure how this works, but it depends on how the chunks for BOT-fixup are claimed - maybe the refinement thread could fix the BOT for that chunk first (if unclaimed)? Just an idea, could the refinement threads do that BOT-fixup work first "before" (or during) refining work?

E.g. if a refinement thread gets a card, first check if there is a corresponding BOT-fixup chunk to do, and do that first, doing all to be refined cards in that PLAB area at the same time (or don't do that at the same time - there are a few options here).

I do not think leaving some work for card scanning is an issue as long as most work is done.

> 
> 4. If another gc pause is scheduled, we need to wait for the fixing phase to complete. Because we assume that "BOT is going to be fixed almost entirely anyway. Might as well do it sooner". We can potentially abort though, and handle the unfixed part in some other way.

This paragraph indicates to me that the BOT-fix thread has higher priority than the GC pause, which seems to be the wrong way to look at the problem - in this case I would rather have the GC eat the cost of fixing up the BOT as the BOT fixup seems supplementary.
Delaying remaining BOT-fixing work for after the GC pause seems not to be a good idea, this seems to only result in forever piling up work (assuming in steady state the GC is always faster than completing that work).
Not completely sure if this is what you meant.

Or did you mean that in the pause some extra(?) phase completes the fixup work, which could be another option?

However just leaving the remainder of the work for the card scan, even if it sometimes does duplicate work does not seem to be an issue to me. This problem of duplicate work is probably largest on large heaps, where the time between pauses allows for complete concurrent BOT-fixup anyway.

> 
> 
> My thoughts for this solution:
> 
> 1. Why not use a queue to record precisely where the plabs are?
>        First, the queue costs space. Maybe 1/128 of heap space at worst (e.g., 4 bytes for plab pointer; a region can have many plabs but only plabs larger than a card are interesting enough, so region_size/512 plabs in total). We can record very large plabs only, to reduce the queue size. Still it costs space.

Fwiw there is a fairly large default minimum PLAB size (2k plus something) which bounds the amount too; you could also do some simple compression here that would decrease the size of these data points significantly: since it is possibly sufficient to just store the "start card" of that PLAB anyway, i.e. you can actually get away with 16 bit entries similar to the remembered sets.

Actually, you could probably directly use a `G1CardSet` to store these, probably with some tuned settings, you might not need an extra data structure at all (depending on the needs on that data structure). The `G1CardSet` is a fairly compact representation.

Another option could be also just store a pointer to the first PLAB per region where this work needs to be done (similar to `G1CMRootMemRegions`); the additional PLAB boundaries could be recalculated by using the (known) PLAB size. I.e. `base + x * PLABSize` gives you the x+1`th PLAB start.

Direct allocations are a (small) problem, as they change that boundary, however I think these can be detected assuming the refinement threads do not interfere; given a `cur` pointer within that region, which indicates where other threads need to continue fixing up the BOT (until `top()` of course), if the BOT entry at `card(cur + PLABSize)` does not exactly point to `base + PLABSize` (i.e. the start of the next PLAB), this must have been a single object (or the first object's size is larger than a PLAB), so just adjust `cur` by that single object's size. I *think* the BOT of these single objects is already correct, so it can be skipped.

Another argument against a too elaborate scheme here is that the amount of old gen regions allocated should be fairly small in the normal case due to the generational hypothesis, so the memory consumption for this should not matter.

> Secondly, enqueue action costs a little allocation time. Thirdly, if we want to do dummy object filling on top of fixing, as Thomas suggested earlier, it's going to need a heap walk anyway (I imagine).

G1 already does a heap walk after marking (rebuild remembered set phase). Only then we also need dummy-object filling (without a mark there is no class unloading, i.e. no objects get invalid).

The difference to the BOT-fixup work is that you want to do this BOT-fixup work after every GC (that allocates into old).

(This interrupting action also needs to be considered when/if implementing that concurrent reformatting).

The dummy-object formatting has only been a note about that the same work (BOT-fixup) also needs to be done if implementing the other, so similar/same issues will arise.

That is however something to be explored in a different CR/PR when/if done.

>        One thing I didn't quite get over is, what do you mean by
> 
> 
> > One idea could be that after allocating a PLAB (in old gen only of course) we enqueue some background task to fix these. We know the exact locations and sizes of those, maybe this somewhat helps. We also know the chunk sizes that are used for scanning, so we know the problematic places.
> 
> Thomas? I don't know how knowing the chunk sizes is going to help with narrow down the problematic places. I feel like I'm being stupid and miss something :D Would you help me understand? Do you think we can work from here?

The "problematic places" were the places where such an overlap of work during card scan may occur, still assuming that we continue to fix up the card table during card scan.

I.e. if the chunk size for card scan is e.g. 16k words, one option could be to only fix the BOT for PLABs crossing that 16k boundaries; since we know that during card scan threads are claiming cards per 16k chunks (idk right now what it actually is, probably higher), there can only be overlapped work happening across them. So if we had a good BOT entry at every 16k crossing, every thread during card scan would "never" do duplicate work. At least I believe so :)

Of course, if the card scanning would not need to do the fixup, it would be even better as this, as your graphs indicate, decreases pause time.

> 
> I'll upload the code when it's more polished (supposing the general direction is okay :D)

So far it sounds good and the results seem good as well.

> 
> Some experiment details (using `numactl -C 0-15 -m 0 java -Xmx8g -Xms8g -XX:ParallelGCThreads=4 <-XX:G1ConcRefinementGreenZone=$refine> <-XX:-G1UseAdaptiveConcRefinement> -Xlog:gc*=debug:file=$fname -Dspecjbb.controller.type=PRESET -Dspecjbb.controller.preset.ir=8000 -Dspecjbb.controller.preset.duration=600000 -jar specjbb2015.jar -m COMPOSITE`):

The graphs look promising so far; I'm only a bit concerned about the total pause time outliers (system.gc()?) as they seem a bit high. Also for some reason the command line above only configures 4 gc threads out of 16.

I'm looking forward to these patches. :)

Thanks,
  Thomas

-------------

PR: https://git.openjdk.java.net/jdk/pull/5039