RFR: 8272083: G1: Record iterated range for BOT performance during card scan
Yude Lin
github.com+16811675+linade at openjdk.java.net
Sat Aug 21 09:28:26 UTC 2021
On Sat, 7 Aug 2021 04:56:08 GMT, Yude Lin <github.com+16811675+linade at openjdk.org> wrote:
> A fix to the problem in 8272083 is to use a per-worker pointer to indicate where the worker has scanned up to, similar to the _scanned_to variable. The difference is this pointer (I call it _iterated_to) records the end of the object and _scanned_to records the end of the scan. Since we always scan with increasing addresses, the end of the latest object scanned is also the address where BOT has fixed up to. So we avoid having to fix below this address when calling block_start(). This implementation approximately reduce the number of calls to set_offset_array() during scan_heap_roots() 2-10 times (in my casual test with -XX:G1ConcRefinementGreenZone=1000000).
>
> What this approach not solving is random access to BOT. So far I haven't found anything having this pattern.
(For completeness, I'll restate the problem in 8272083.)
The block offset table (BOT) is something that serves the purpose
"given a heap word, return the head of the object that covers this heap word"
We typically use BOT to find the objects that covers the dirty memory marked by a dirty card.
We need to update BOT with the information it needs (obj addr, obj size) at allocation. But during gc, we update it with (plab addr, plab size). When allocating within a plab, we don't update the BOT. So after a gc, BOT becomes crude. BOT will refine itself lazily when it's being used, that is, it will walk the plabs and update itself according to the objects it found during the walk. The current algorithm is such that BOT will sometimes walk the same part of a plab more than once, which is unnecessary. The update operation has an atomic store for each 512-byte memory it walks.
This affects both concurrent refine and scan heap roots during pause.
Using a dedicated "concurrent BOT fixing phase" we can remove most of the duplicated updates, and also move most of the actually-necessary updates out of gc pause. The result is something like

This is a zoom-in of a box plot which shows the fixing phase (indicated by "fix" in the label) can reduce the total scan heap roots time. It's more effective when the refinement strength is low (indicated by refine=x, x means the non-adaptive green threshold, and "auto" means using the default adaptive mode).
It can also increases the concurrent refinement speed, because it also reduces the work in there. The cost is a dedicated concurrent phase, which is described as follows.
1. PLABs are created during evacuation to contain promoted objects from survivor regions and surviving objects from old regions (mixed gc only). In each old region, the area between the old top() and new top(), before and after an evacuation pause, is where the new PLABs are, which we will record.
2. After the pause, a concurrent phase will be scheduled to fix BOT in these areas. Basically a fixing worker will claim a part of the area (we'll call it a chunk), and walk the memory for objects. If an objects crosses a BOT boundary, it will update the BOT entry. Some tweaks need to be made to make sure different workers never fix the same BOT entry.
3. If concurrent refinement tries to refine a card, it will probably run into an unfixed part of BOT. We prevent this by requiring the refiner to fix where the card is pointing at. If other worker has claimed the chunk where the card is pointing at, then refinement can proceed without a guarantee the chunk is actually fixed. This not often happens but might be a problem.
4. If another gc pause is scheduled, we need to wait for the fixing phase to complete. Because we assume that "BOT is going to be fixed almost entirely anyway. Might as well do it sooner". We can potentially abort though, and handle the unfixed part in some other way.
My thoughts for this solution:
1. Why not use a queue to record precisely where the plabs are?
First, the queue costs space. Maybe 1/128 of heap space at worst (e.g., 4 bytes for plab pointer; a region can have many plabs but only plabs larger than a card are interesting enough, so region_size/512 plabs in total). We can record very large plabs only, to reduce the queue size. Still it costs space. Secondly, enqueue action costs a little allocation time. Thirdly, if we want to do dummy object filling on top of fixing, as Thomas suggested earlier, it's going to need a heap walk anyway (I imagine).
One thing I didn't quite get over is, what do you mean by
> One idea could be that after allocating a PLAB (in old gen only of course) we enqueue some background task to fix these. We know the exact locations and sizes of those, maybe this somewhat helps. We also know the chunk sizes that are used for scanning, so we know the problematic places.
Thomas? I don't know how knowing the chunk sizes is going to help with narrow down the problematic places. I feel like I'm being stupid and miss something :D Would you help me understand? Do you think we can work from here?
I'll upload the code when it's more polished (supposing the general direction is okay :D)
Some experiment details (using `numactl -C 0-15 -m 0 java -Xmx8g -Xms8g -XX:ParallelGCThreads=4 <-XX:G1ConcRefinementGreenZone=$refine> <-XX:-G1UseAdaptiveConcRefinement> -Xlog:gc*=debug:file=$fname -Dspecjbb.controller.type=PRESET -Dspecjbb.controller.preset.ir=8000 -Dspecjbb.controller.preset.duration=600000 -jar specjbb2015.jar -m COMPOSITE`):




-------------
PR: https://git.openjdk.java.net/jdk/pull/5039
More information about the hotspot-gc-dev
mailing list