RFR: 8272083: G1: Record iterated range for BOT performance during card scan [v3]

Fri Oct 8 10:20:08 UTC 2021

On Fri, 1 Oct 2021 11:52:50 GMT, Thomas Schatzl <tschatzl at openjdk.org> wrote:

> G1 already uses so many threads, so adding more does not seem to be a good idea. Also, just one thread is going to be overwhelmed on large heaps, probably making the method less effective there where it is more necessary than in other cases. Maybe just fake cards to scan in the DCQS so that this work is done first (and always) by the refinement threads? Some tweaking of thread numbers and refinement threads is likely needed.

This is a good point. I'll see if I can create some kind of fake refine tasks. After all real refine tasks and fake tasks do similar things like calling block_start() so it should be possible.

> Not sure about whether the complexity for using the bitmap level as storage is worth the effort: in my testing I have never even come close to 512 PLABs per region. In that case (or even earlier), probably just bail out, drop the whole task and do nothing as with that many PLABs the amount of overlap during gc is likely to be small. I need to do some more testing and thinking about this though.

Usually I observe small plabs (~2k in size) at the beginning of my specjbb tests. Maybe it's because the heuristics are yet to adjust the plab size. I guess my only concern is, if this is the case, we will find ourselves allocating a lot of large c-heap arrays, which might be bad for pause times. But I think bailing out is also a good choice, considering it simplifies the card set design a lot.

> the G1BOTFixingCardSet in HeapRegion should at most be a pointer within HeapRegion: since only a small percentage of regions are ever affected by this, it seems a waste to always allocate memory for them, even if only little.

Yes. Will change that.

> Actually I have seen only mid single digit number of plabs per region whatever I have been running; so I even kind of think it might be useful to decrease the maximum PLAB size to have more of those so that more threads can work on these and the individual BOT fixup is faster (to abort faster). I have no particular guidance here at this time of how large is too large; but something like half or a third of a region for 32m regions is quite a bit to chew on :) This of course affects the storage needs, but this limit should always be so that we would never want to use the bitmap.

I imagine imposing a max plab size will affect evacuation efficiency right? I'm not sure about how this weighs in..

> some potential renames to be done only when we are done evaluating this: rename this feature to G1ConcurrentBOTUpdate, not "fixing" :)

Will change the name.

> There is another option for storing whether this part of the BOT is unrefined yet: take a bit from the BOT values themselves to encode that.
> I did not look whether this is actually possible with the current encoding, but is an option if that does not take away too much of the range of the backskip - at the moment we use the values 0-63 (*8 = 512) to encode the offsets within the card, all higher values are backskip values after all.
> I.e. it might even be that extremely high backskip values which would not be used in all but huge arrays (if at all) are available for such a thing.
> Just some weird idea that came to my mind...

I think the backskip value is log_16(number_of_plabs_to_skip). So I don't think there is that big an array to require a large backskip value. A back_skip=4 with plab_size=512 means a 32m skip, already bigger than the biggest non-humongous object. So maybe the highest bit is always available.

Nontheless, this mean we change a lot of BOT code. It becomes more complicated and slightly slower, e.g., when doing atomic updates to an entry, originally we do ``Atomic::store()`` and now we do a looped ``Atomic::cmpxchg()`` to preserve the special bit. Can we afford that?

> Another minor optimization is that if we used the refinement information in some way, only update the BOT for areas that actually may ever be scanned - i.e. anything that is may be put into a remembered set.
> 
> I.e. we might be promoting a lot of data that never needs to be scanned (because e.g. they reference almost only data within a region).
> 
> Admittedly this is all just slinging ideas around and see if something sticks.

I haven't looked at how this information is maintained and utilized before. Need to learn more about it. Thanks for another pointer :)

-------------

PR: https://git.openjdk.java.net/jdk/pull/5039