RFR: 8272083: G1: Record iterated range for BOT performance during card scan [v3]

Mon Sep 13 07:45:53 UTC 2021

On Mon, 13 Sep 2021 04:44:51 GMT, Yude Lin <github.com+16811675+linade at openjdk.org> wrote:

>> A fix to the problem in 8272083 is to use a per-worker pointer to indicate where the worker has scanned up to, similar to the _scanned_to variable. The difference is this pointer (I call it _iterated_to) records the end of the object and _scanned_to records the end of the scan. Since we always scan with increasing addresses, the end of the latest object scanned is also the address where BOT has fixed up to. So we avoid having to fix below this address when calling block_start(). This implementation approximately reduce the number of calls to set_offset_array() during scan_heap_roots() 2-10 times (in my casual test with -XX:G1ConcRefinementGreenZone=1000000).
>> 
>> What this approach not solving is random access to BOT. So far I haven't found anything having this pattern.
>
> Yude Lin has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Resolve TODOs

Hi Thomas,

I've updated the pr with new code. I hope at least it's good for further discussion.

An outline of the current implementation:

1. During gc:
> PLABs are created during evacuation to contain promoted objects from survivor regions and surviving objects from old regions (mixed gc only). In each old region, the area between the old top() and new top(), before and after an evacuation pause, is where the new PLABs are, which we will record.

The above hasn't changed. We will record the old top() and every plab that has crossed a card boundary. Recording is now done using a card set. Each card will represent a plab.

2. After the pause, a concurrent phase will be scheduled to fix BOT in these areas. Basically a fixing worker will claim a card from the card set. This card tells us a plab needs to be fixed. This worker then fixes the BOT entries covered by the plab.

3. For concurrent refinement:
> If concurrent refinement tries to refine a card, it will probably run into an unfixed part of BOT. We prevent this by requiring the refiner to fix where the card is pointing at.

The card could be pointing into a plab, which needs to be fixed. We will ask the card set whether this is the case. If it is, the card set will return the card of the plab and the concurrent refinement thread (or java thread) should fix it before doing refine.

4. If another gc pause is scheduled, we abort the unfinished fixing jobs and clear the card sets.

The card set implementation:

This card set needs to be populated in parallel; and removed from concurrently. The G1CardSet doesn't allow concurrent removing without major modification (from what I understand). Also the cards we are managing has the following characteristics:
* They are in the back of a region;
* They are at a distance apart from each other (at least plab size).

To take advantage of these, I used a separate card set. It can use an array that has size=region_size/plab_size; or it can use a bitmap of cards. It will only use bitmap when the plab size is small enough so that the bitmap is smaller than the array. I tried to make the card set implementation decoupled from the fixer, so that if you think it's insufficient, we can improve or replace it.

Other consideration:

You are right about that fixing a single plab could stall java threads' concurrent refinement for too long. It occasionally gets over 10ms (most of the time below 1ms). An idea is to abort when fixing takes too long, or just don't let the refinement threads fix large plabs when we predict it will take very long. But large plabs are the most costly ones and are the reasons why we want fixing. Maybe there is a way to postpone concurrent refinement a little after a gc? E.g., increase the buffer size temporarily. So that as many plabs are processed by fixer threads as possible. I'm not sure.

Thank you!

Regards,

Yude

-------------

PR: https://git.openjdk.java.net/jdk/pull/5039