RFC: New Serial Full GC
Stefan Johansson
stefan.johansson at oracle.com
Tue Jan 30 13:52:47 UTC 2024
Hi Roman,
Thanks for sharing this. I have a few comments and some benchmarks.
On 2024-01-25 12:04, Kennke, Roman wrote:
> In particular:
> - Do you think it is feasible at all to do this? Performance seems to be on-par with (or even slightly better than) the current mark-compact algorithm. But it requires more memory: ~1.5% of the heap for the marking bitmap and another ~1.5% for the block-offset-table. Those are maximum values. This might be a concern, given that one of the advantages of Serial GC is that it requires little extra memory. On the other hand: 1. It currently preserves headers on the side. While this usually doesn’t need much space, some workloads churn locks and/or mass-i-hash objects, in which case the preserved-marks-table can become *much* larger than the extra 3%. 2. Shaving-off a whole word per object with Lilliput’s 32-bit headers is going to save more space than the extra structures take away, so this is still going to be a net-plus, especially because the savings are during the whole lifetime of an application, while the extra memory is only needed during (rare) full-GCs.
My main concern would not be the performance of the Full GC but the
additional memory. As you say, this would remove a big part of Serial
GCs main/only advantage, the low memory overhead. Even if the additional
memory is only used during the Full GC I would assume that the peak
overhead is what's important in many cases.
My first thought when reading this was, if we add this overhead to
Serial is it still worth to keep it around? Could we have a single
threaded mode of G1 that has similar overhead. Did some quick runs using
a small test and a 5G heap. I got these overhead numbers using NMT:
Baseline Serial: ~14MB
New Serial: ~14MB (Peak: ~174MB)
G1: ~167MB
G1 + ParallelGCThreads=1: ~138MB
This is of course just one very simple data point and G1 would likely
consume additional memory in a real world use case (and would also need
to be rewritten using a similar technique). So I'm not proposing to drop
Serial instead of doing this new full GC implementation, but I think
it's important to think along those lines. To me a big reason for having
the Serial collector is use cases where you don't care about GC
performance but want low overhead.
> - Do you know of any targeted benchmarks that specifically measure full-GC performance? So far I only used this benchmark, which exxagerates the performance impact of the full-GC code, and measure the average time reported for the full-GC phases:
I actually have a few different small SystemGC benchmarks that I used
way back when I did the parallel Full GC for G1. The plan was to
contribute them at some point but haven't happend yet :) I dug them out
and updated them a bit to make sure the run as intended out of the box.
They are very similar to the test you have used but a few more
scenarios. Not at all testing everything, but something to give some
peace of mind when it comes to performance. The branch is here:
https://github.com/openjdk/jdk/compare/master...kstefanj:jdk:system-gc-benchmarks
They are based on JMH and uses the single shot mode. Run them either
through the make files:
make test TEST=micro:org.openjdk.bench.vm.gc.SystemGC*
Or when you have the benchmarks jar:
java -jar benchmarks.jar SystemGC
The tests generate about 1 GB of objects, so to ensure that no GCs occur
before the actual System.gc() and usual run the benchmarks with:
-Xmx5g -Xms5g -Xmn3g
Hope this helps. Though, I don't think performance is what we will
discuss the most around this change.
Regards,
StefanJ
More information about the hotspot-gc-dev
mailing list