Reduce CPU usage of Remembered Sets by configuring grow hint of concurrent hash table

Thu Jun 30 14:52:13 UTC 2022

Hi,

On 30.06.22 16:24, Tianqi Xia wrote:
> Currently G1CardSet uses a concurrent hash table under the hood to keep 
> a mapping between heap regions and card set containers. After running 
> some experiments, I found adjusting the grow hint of the underlying 
> concurrent hash table can potentially bring down the overall CPU usage 
> of the process quite a bit.
> 
> Here is a brief summary of my testing result:
> 
> Benchmark: BigRamTester
> 
> JDK version: master branch, commit 779b4e1d1959bc15a27492b7e2b951678e39cca8
> 
> Testing command: java -XX:+AlwaysPreTouch 
> -Xlog:gc*=debug:gc.log::filecount=10,filesize=20m -Xms20g -Xmx20g 
> BigRamTester
> 
> Without any modification, the average CPU usage (measured by top, with 
> 5-min sampling period) is roughly 660%; After changing the grow hint of 
> the card set concurrent hash table from 4 to 1, the CPU usage can be 
> reduced to 620%.
> 
> The impact of this change on memory usage is minimal. The RES (report by 
> top) before/after the change is something like 21.295G vs 21.30G, and 
> the native memory usage of G1CardSet (report by NMT) also shows no 
> difference.
> 
> Theoretically by increasing the grow hint, we are preferring a more 
> "flatten" hash table: less time is spent on traversing the collision 
> list, less CPU is used. What I propose is, shall we make the grow hint a 
> configurable parameter? Please let me know if i missed anything.
> 

   we've had something like this in the initial implementation but did 
not have time to look into this some more, so removed that interface to 
the user.

E.g. see the constructor of G1CardSetHashTable where there is a 
parameter to set its initial size, but never use it.

Same about the maximum typical number of links in the chain.

Similar applies to G1SegmentedArrayAllocOptions always being the same 
(or even having different G1CardSetConfigurations).

Did you test with different initial settings for different types of 
regions? Particularly bigramtester tends to have fairly unique 
distribution depending on generation iirc.

Ideally these values would somehow be automatically derived, and not set 
by an option.

So in general, yes, we are interested in improvements there.

As a first step it would also be useful to e.g. print gc/refinement 
threads' vtime (I think this would mostly reduce refinement thread 
vtimes?) in a more straightforward/useful manner (and just ignore the 
OSes where this is not easily possible).

I'm thinking about some (periodic?) log output that just prints 
(refinement) thread vtimes one after another (maybe relative?), maybe 
with an initial header, but I haven't thought it through.
There is some (badly formatted) output for the marking threads iirc already.

At least I think it would generally be fairly useful to provide such 
information by the VM as hooking some third party tool is sometimes a 
bit cumbersome.

Of course, all this is just my initial reaction, I'm hoping others chime 
in too :)

Thanks,
   Thomas