RFR: 8272083: G1: Record iterated range for BOT performance during card scan [v4]

Wed Oct 13 05:48:52 UTC 2021

On Sat, 9 Oct 2021 10:05:07 GMT, Yude Lin <duke at openjdk.java.net> wrote:

>> A fix to the problem in 8272083 is to use a per-worker pointer to indicate where the worker has scanned up to, similar to the _scanned_to variable. The difference is this pointer (I call it _iterated_to) records the end of the object and _scanned_to records the end of the scan. Since we always scan with increasing addresses, the end of the latest object scanned is also the address where BOT has fixed up to. So we avoid having to fix below this address when calling block_start(). This implementation approximately reduce the number of calls to set_offset_array() during scan_heap_roots() 2-10 times (in my casual test with -XX:G1ConcRefinementGreenZone=1000000).
>> 
>> What this approach not solving is random access to BOT. So far I haven't found anything having this pattern.
>
> Yude Lin has updated the pull request incrementally with four additional commits since the last revision:
> 
>  - Fixed a bug in array claim
>  - Fixed a bug that causes premature deactivate
>  - Renaming variables
>  - Renaming

Hi Stefan,

I looked at the PoC code. My understanding is you're updating the BOT as objects that cross card boundaries are allocated in a PLAB. I haven't try this particular approach. But my first reaction when I found this issue is also to process the plabs in the pause. (I chose a lazier approach, that is, during gc pause, update BOT for plabs allocated in the last gc pause. Lazy or not, I think there is little difference. The lazy approach needs an additional phase, and some code to coordinate parallel BOT update, which has overhead; whereas updating BOT as objects are allocated into a plab, might waste some work, because in mixed gc, there might be some old regions we never ever need to scan?)

Anyway, I like the idea of removing all cost from the pause time, which is what the current approach tries to achieve. I don't think there will be lot more additional concurrent work than there currently is. Because if we don't update BOT concurrently, the refinement threads still has to update a large part of BOT. So in effect it transfers the work from concurrent refine to concurrent BOT update. As you can see in an earlier graph, the concurrent refinement rates actually increased. But this is to compare concurrent BOT update vs no BOT update at all. If we were to compare concurrent BOT update vs paused BOT update, yes, there will be additional concurrent work. But I think concurrent work should be favored over pause-time work, generally speaking. By the way, I'm working on an update on the patch. It will reuse the concurrent refinement threads and dirty card queue infrastructure, as suggested in an earlier discussion. The patch looks less scary without the additional threads and c
 ard set data structures. I hope that will lessen your worry about this solution. Thanks!

Regards,
Yude

-------------

PR: https://git.openjdk.java.net/jdk/pull/5039