RFR: 8272083: G1: Record iterated range for BOT performance during card scan

Wed Aug 11 09:01:26 UTC 2021

On Tue, 10 Aug 2021 11:24:50 GMT, Yude Lin <github.com+16811675+linade at openjdk.org> wrote:

> > From what I understand from a brief look at the backwards scan prototype (I will continue a bit, and I may be wrong), the code actually implements scanning backwards the cards within claimed chunks. I wasn't really suggesting to scan backwards the cards, but claim the chunks (i.e. the set of cards a given thread will look at) from top to bottom addresses within the region. Or just have the first claim be the top chunk, or as a first task for a completely untouched region, have the thread explicitly update the BOT. Or after claiming a chunk, first fix the bot for that chunk, and then scan through.
> 
> The prototype reverses both the chunk claiming process and the card scanning process in each chunk. Because I thought within each chunk there could also be repetitive work if we do forward scan.

I would not think so. At the first card in that last chunk point the last index for the BOT is up to date, and we are linearly improving the BOT from there.

> 
> > I am also trying to understand a little how serious the problem is in terms of actual performance hit by the duplicate work. It might be fairly high when not doing refinement at all as you indicate.
> 
> I can do a test that compares the scan time with and without the duplicate work removed. And maybe change the concurrent refinement strength to create different groups. It seems the decisions depend on more data. E.g., if concurrent refinement is on, and the de-duplicate process isn't very effective in terms of reduction in scan time, then the process itself should be too heavy weight?

That would be great to quantify this issue (mostly out of curiosity).

> > As for concurrent BOT update, I believe that the existing concurrent refinement code already concurrently updates the BOT as needed afaik so that shouldn't be an issue (I need to go look in detail). One idea could be that after allocating a PLAB (in old gen only of course) we enqueue some background task to fix these. We know the exact locations and sizes of those, maybe this somewhat helps. We also know the chunk sizes that are used for scanning, so we know the problematic places.
> > Something like the "root region scan" for marking (but we do not need to wait for it to complete during gc).
> 
> I misunderstood the concurrent update part. Now I see that you are actually suggesting moving almost the entire BOT fixing work out of the gc pause. I think that might further reduce the pause time. If I understand correctly, concurrent refinement fixes part of the BOT, but there are some leftovers. Now that the leftovers are also moved to the concurrent phase. I think it's worth some experiment too!

+1 The next step would then be formatting the "holes" within the live objects with dummy objects while we are already doing that, and then we could drop the second bitmap :) See also https://bugs.openjdk.java.net/browse/JDK-8210708 for the idea. Of course, a completely different issue...

Thanks,
  Thomas

-------------

PR: https://git.openjdk.java.net/jdk/pull/5039