RFC: New Serial Full GC

Tue Jan 30 13:52:47 UTC 2024

Hi Roman,

Thanks for sharing this. I have a few comments and some benchmarks.

On 2024-01-25 12:04, Kennke, Roman wrote:
> In particular:
> - Do you think it is feasible at all to do this? Performance seems to be on-par with (or even slightly better than) the current mark-compact algorithm. But it requires more memory: ~1.5% of the heap for the marking bitmap and another ~1.5% for the block-offset-table. Those are maximum values. This might be a concern, given that one of the advantages of Serial GC is that it requires little extra memory. On the other hand: 1. It currently preserves headers on the side. While this usually doesn’t need much space, some workloads churn locks and/or mass-i-hash objects, in which case the preserved-marks-table can become *much* larger than the extra 3%. 2. Shaving-off a whole word per object with Lilliput’s 32-bit headers is going to save more space than the extra structures take away, so this is still going to be a net-plus, especially because the savings are during the whole lifetime of an application, while the extra memory is only needed during (rare) full-GCs.

My main concern would not be the performance of the Full GC but the 
additional memory. As you say, this would remove a big part of Serial 
GCs main/only advantage, the low memory overhead. Even if the additional 
memory is only used during the Full GC I would assume that the peak 
overhead is what's important in many cases.

My first thought when reading this was, if we add this overhead to 
Serial is it still worth to keep it around? Could we have a single 
threaded mode of G1 that has similar overhead. Did some quick runs using 
a small test and a 5G heap. I got these overhead numbers using NMT:
Baseline Serial: ~14MB
New Serial: ~14MB (Peak: ~174MB)
G1: ~167MB
G1 + ParallelGCThreads=1: ~138MB

This is of course just one very simple data point and G1 would likely 
consume additional memory in a real world use case (and would also need 
to be rewritten using a similar technique). So I'm not proposing to drop 
Serial instead of doing this new full GC implementation, but I think 
it's important to think along those lines. To me a big reason for having 
the Serial collector is use cases where you don't care about GC 
performance but want low overhead.

> - Do you know of any targeted benchmarks that specifically measure full-GC performance? So far I only used this benchmark, which exxagerates the performance impact of the full-GC code, and measure the average time reported for the full-GC phases:

I actually have a few different small SystemGC benchmarks that I used 
way back when I did the parallel Full GC for G1. The plan was to 
contribute them at some point but haven't happend yet :) I dug them out 
and updated them a bit to make sure the run as intended out of the box. 
They are very similar to the test you have used but a few more 
scenarios. Not at all testing everything, but something to give some 
peace of mind when it comes to performance. The branch is here:
https://github.com/openjdk/jdk/compare/master...kstefanj:jdk:system-gc-benchmarks

They are based on JMH and uses the single shot mode. Run them either 
through the make files:
make test TEST=micro:org.openjdk.bench.vm.gc.SystemGC*
Or when you have the benchmarks jar:
java -jar benchmarks.jar SystemGC

The tests generate about 1 GB of objects, so to ensure that no GCs occur 
before the actual System.gc() and usual run the benchmarks with:
-Xmx5g -Xms5g -Xmn3g

Hope this helps. Though, I don't think performance is what we will 
discuss the most around this change.

Regards,
StefanJ