RFR: 8272083: G1: Record iterated range for BOT performance during card scan [v3]

Fri Oct 1 11:55:31 UTC 2021

On Mon, 13 Sep 2021 04:44:51 GMT, Yude Lin <github.com+16811675+linade at openjdk.org> wrote:

>> A fix to the problem in 8272083 is to use a per-worker pointer to indicate where the worker has scanned up to, similar to the _scanned_to variable. The difference is this pointer (I call it _iterated_to) records the end of the object and _scanned_to records the end of the scan. Since we always scan with increasing addresses, the end of the latest object scanned is also the address where BOT has fixed up to. So we avoid having to fix below this address when calling block_start(). This implementation approximately reduce the number of calls to set_offset_array() during scan_heap_roots() 2-10 times (in my casual test with -XX:G1ConcRefinementGreenZone=1000000).
>> 
>> What this approach not solving is random access to BOT. So far I haven't found anything having this pattern.
>
> Yude Lin has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Resolve TODOs

Initial thoughts while looking through it for the first time and testing it a bit more. More thoughts may be dripping in over time:

* G1 already uses so many threads, so adding more does not seem to be a good idea. Also, just one thread is going to be overwhelmed on large heaps, probably making the method less effective there where it is more necessary than in other cases. Maybe just fake cards to scan in the DCQS so that this work is done first (and always) by the refinement threads? Some tweaking of thread numbers and refinement threads is likely needed.
* Not sure about whether the complexity for using the bitmap level as storage is worth the effort: in my testing I have never even come close to 512 PLABs per region. In that case (or even earlier), probably just bail out, drop the whole task and do nothing as with that many PLABs the amount of overlap during gc is likely to be small. I need to do some more testing and thinking about this though.
* the G1BOTFixingCardSet in HeapRegion should at most be a pointer within HeapRegion: since only a small percentage of regions are ever affected by this, it seems a waste to always allocate memory for them, even if only little.
* Actually I have seen only mid single digit number of plabs per region whatever I have been running; so I even kind of think it might be useful to decrease the maximum PLAB size to have more of those so that more threads can work on these and the individual BOT fixup is faster (to abort faster). I have no particular guidance here at this time of how large is too large; but something like half or a third of a region for 32m regions is quite a bit to chew on :) This of course affects the storage needs, but this limit should always be so that we would never want to use the bitmap.
* some potential renames to be done only when we are done evaluating this: rename this feature to `G1ConcurrentBOTUpdate`, not "fixing" :)

-------------

PR: https://git.openjdk.java.net/jdk/pull/5039