RFR: 8272083: G1: Record iterated range for BOT performance during card scan [v5]

Tue Oct 26 14:08:10 UTC 2021

On Mon, 18 Oct 2021 09:32:40 GMT, Yude Lin <duke at openjdk.java.net> wrote:

>> A fix to the problem in 8272083 is to use a per-worker pointer to indicate where the worker has scanned up to, similar to the _scanned_to variable. The difference is this pointer (I call it _iterated_to) records the end of the object and _scanned_to records the end of the scan. Since we always scan with increasing addresses, the end of the latest object scanned is also the address where BOT has fixed up to. So we avoid having to fix below this address when calling block_start(). This implementation approximately reduce the number of calls to set_offset_array() during scan_heap_roots() 2-10 times (in my casual test with -XX:G1ConcRefinementGreenZone=1000000).
>> 
>> What this approach not solving is random access to BOT. So far I haven't found anything having this pattern.
>
> Yude Lin has updated the pull request incrementally with three additional commits since the last revision:
> 
>  - Removed additional thread and card set code
>  - Switch to dcq and refinement threads to manage the plab cards
>  - Trivial

I've looked through the patch but won't focus on reviewing this right now. I've instead spent time on running some testing on it comparing it to my approach of doing the work inside the pause. 

Some observations:
- Both our approaches touches the hot-path and add code to `do_copy_to_survivor_space(...)`, I don't see any clear regression in object copy times for either approach which is very good. 
- I also see a clear reduction in scan times for both approaches, but not as big when doing the work concurrently. This could be because not everything gets updated between the pauses. 
- The "Total refinement" time (`-Xlog:gc+refine+stats`) also goes down with both approaches, quite significantly, but again the decrease is bigger when doing the work in the pause. This is not so surprising since no additional work is added to the refinement threads for my approach.

I have also done a very hacky PoC that updates all new old regions concurrently using the G1 service thread. This approach doesn't need to touch the object copy path but instead just records that any new old region needs to be "fixed". This approach looks very good from a pause time perspective, but just using one thread doesn't scale very well. 

One problem I see with using the refinement threads (to allow scaling better) is that it will probably make the heuristic for scaling the number of threads a bit more complicated, because there are two types of work that should be handled. Have you thought anything about that?

One way forward would be to first go with a solution doing the work inside the pause and then continue investigating how to move it to concurrent threads in an efficient and maintainable way.

-------------

PR: https://git.openjdk.java.net/jdk/pull/5039