RFR: 8272083: G1: Record iterated range for BOT performance during card scan

Tue Aug 10 07:43:37 UTC 2021

On Sat, 7 Aug 2021 04:56:08 GMT, Yude Lin <github.com+16811675+linade at openjdk.org> wrote:

> A fix to the problem in 8272083 is to use a per-worker pointer to indicate where the worker has scanned up to, similar to the _scanned_to variable. The difference is this pointer (I call it _iterated_to) records the end of the object and _scanned_to records the end of the scan. Since we always scan with increasing addresses, the end of the latest object scanned is also the address where BOT has fixed up to. So we avoid having to fix below this address when calling block_start(). This implementation approximately reduce the number of calls to set_offset_array() during scan_heap_roots() 2-10 times (in my casual test with -XX:G1ConcRefinementGreenZone=1000000).
> 
> What this approach not solving is random access to BOT. So far I haven't found anything having this pattern.

>From what I understand from a brief look at the backwards scan prototype (I will continue a bit, and I may be wrong), the code actually implements scanning backwards the cards within claimed chunks. I wasn't really suggesting to scan backwards the cards, but claim the chunks (i.e. the set of cards a given thread will look at) from top to bottom addresses within the region. Or just have the first claim be the top chunk, or as a first task for a completely untouched region, have the thread explicitly update the BOT. Or after claiming a chunk, first fix the bot for that chunk, and then scan through.

The first thread claiming the top chunk will do all the work, and by the time the other threads (working on other regions) have finished, the BOT is done and ready for use.

Afaik the BOT update work needs to be done always anyway, so you could imagine that it might be useful to do it as an entirely extra step (as needed). Note that I'm not sure if these ideas are actually any good :P

I am also trying to understand a little how serious the problem is in terms of actual performance hit by the duplicate work. It might be fairly high when not doing refinement at all as you indicate.

The reason is basically that the suggested change bleeds fairly much into other non-card-scanning code so I want to at least think about alternatives that may be limited to e.g. just the remset scan code.

As for concurrent BOT update, I believe that the existing concurrent refinement code already concurrently updates the BOT as needed afaik so that shouldn't be an issue (I need to go look in detail). One idea could be that after allocating a PLAB (in old gen only of course) we enqueue some background task to fix these. We know the exact locations and sizes of those, maybe this somewhat helps. We also know the chunk sizes that are used for scanning, so we know the problematic places.
Something like the "root region scan" for marking (but we do not need to wait for it to complete during gc).

-------------

PR: https://git.openjdk.java.net/jdk/pull/5039