Reduce CPU usage of Remembered Sets by configuring grow hint of concurrent hash table
Thomas Schatzl
thomas.schatzl at oracle.com
Thu Jun 30 14:52:13 UTC 2022
Hi,
On 30.06.22 16:24, Tianqi Xia wrote:
> Currently G1CardSet uses a concurrent hash table under the hood to keep
> a mapping between heap regions and card set containers. After running
> some experiments, I found adjusting the grow hint of the underlying
> concurrent hash table can potentially bring down the overall CPU usage
> of the process quite a bit.
>
> Here is a brief summary of my testing result:
>
> Benchmark: BigRamTester
>
> JDK version: master branch, commit 779b4e1d1959bc15a27492b7e2b951678e39cca8
>
> Testing command: java -XX:+AlwaysPreTouch
> -Xlog:gc*=debug:gc.log::filecount=10,filesize=20m -Xms20g -Xmx20g
> BigRamTester
>
> Without any modification, the average CPU usage (measured by top, with
> 5-min sampling period) is roughly 660%; After changing the grow hint of
> the card set concurrent hash table from 4 to 1, the CPU usage can be
> reduced to 620%.
>
> The impact of this change on memory usage is minimal. The RES (report by
> top) before/after the change is something like 21.295G vs 21.30G, and
> the native memory usage of G1CardSet (report by NMT) also shows no
> difference.
>
> Theoretically by increasing the grow hint, we are preferring a more
> "flatten" hash table: less time is spent on traversing the collision
> list, less CPU is used. What I propose is, shall we make the grow hint a
> configurable parameter? Please let me know if i missed anything.
>
we've had something like this in the initial implementation but did
not have time to look into this some more, so removed that interface to
the user.
E.g. see the constructor of G1CardSetHashTable where there is a
parameter to set its initial size, but never use it.
Same about the maximum typical number of links in the chain.
Similar applies to G1SegmentedArrayAllocOptions always being the same
(or even having different G1CardSetConfigurations).
Did you test with different initial settings for different types of
regions? Particularly bigramtester tends to have fairly unique
distribution depending on generation iirc.
Ideally these values would somehow be automatically derived, and not set
by an option.
So in general, yes, we are interested in improvements there.
As a first step it would also be useful to e.g. print gc/refinement
threads' vtime (I think this would mostly reduce refinement thread
vtimes?) in a more straightforward/useful manner (and just ignore the
OSes where this is not easily possible).
I'm thinking about some (periodic?) log output that just prints
(refinement) thread vtimes one after another (maybe relative?), maybe
with an initial header, but I haven't thought it through.
There is some (badly formatted) output for the marking threads iirc already.
At least I think it would generally be fairly useful to provide such
information by the VM as hooking some third party tool is sometimes a
bit cumbersome.
Of course, all this is just my initial reaction, I'm hoping others chime
in too :)
Thanks,
Thomas
More information about the hotspot-gc-dev
mailing list