RFC: New Serial Full GC

Tue Jan 30 17:28:52 UTC 2024

Hi Stefan,

Some replies inline:

>> In particular:
>> - Do you think it is feasible at all to do this? Performance seems to be on-par with (or even slightly better than) the current mark-compact algorithm. But it requires more memory: ~1.5% of the heap for the marking bitmap and another ~1.5% for the block-offset-table. Those are maximum values. This might be a concern, given that one of the advantages of Serial GC is that it requires little extra memory. On the other hand: 1. It currently preserves headers on the side. While this usually doesn’t need much space, some workloads churn locks and/or mass-i-hash objects, in which case the preserved-marks-table can become *much* larger than the extra 3%. 2. Shaving-off a whole word per object with Lilliput’s 32-bit headers is going to save more space than the extra structures take away, so this is still going to be a net-plus, especially because the savings are during the whole lifetime of an application, while the extra memory is only needed during (rare) full-GCs.
> 
> My main concern would not be the performance of the Full GC but the
> additional memory. As you say, this would remove a big part of Serial
> GCs main/only advantage, the low memory overhead. Even if the additional
> memory is only used during the Full GC I would assume that the peak
> overhead is what's important in many cases.

Righ, that is a valid concern. I asked around a little bit here (team and customers) and the overall feeling was that a predictable small-ish x% max overhead is perhaps better than an unpredictable overhead that occurs when a workload mass-i-hashes objects and/or churns locks. When that happens, the preserved-marks-table can become quite large. If you have to model for that unknown bad case, you have to take a much higher peak into account.

> My first thought when reading this was, if we add this overhead to
> Serial is it still worth to keep it around? Could we have a single
> threaded mode of G1 that has similar overhead. Did some quick runs using
> a small test and a 5G heap. I got these overhead numbers using NMT:
> Baseline Serial: ~14MB
> New Serial: ~14MB (Peak: ~174MB)
> G1: ~167MB
> G1 + ParallelGCThreads=1: ~138MB

Those numbers looks like what I would expect, yes. Can you share the program? Or try the other extreme and I-hash all the objects that you allocate?

> This is of course just one very simple data point and G1 would likely
> consume additional memory in a real world use case (and would also need
> to be rewritten using a similar technique). So I'm not proposing to drop
> Serial instead of doing this new full GC implementation, but I think
> it's important to think along those lines. To me a big reason for having
> the Serial collector is use cases where you don't care about GC
> performance but want low overhead.

Correct. And/or only have 1 or very few cores (e.g. small containers). Consider that we’re doing this to support compact object headers, which typically save more memory than the extra overhead taken by this Serial Full GC implementation.

Also for context: going to 8-byte headers does not strictly require that new full-GC, as long as we can squeeze the forwarding pointers into the lower 32bits of the header. I provided one way to do that here: https://github.com/openjdk/jdk/pull/13582. Another alternative would be to simply use compressed oops, as long as the heap is small enough. However, if/when we want to go to 4-byte headers (which I certainly want to do soon-ish), when we don’t have space for any forwarding pointer in the header and need an alternative. I studied what Parallel GC does and read around in the literature, and what I proposed here looked to me like the best compromise.

I’ve also got some ideas how to reduce memory usage of the algorithm:
- It looks to me like when full-GC is not triggered by promotion failure, it does not collect (young) from-space. If from-space has enough room, we could try to place the BOT and/or the marking bitmap there. I don’t know how common that is, I haven’t even dug into it enough to know under which scenario this happens and if from-space would typically have space left.
- Marking bitmap and BOT could be allocated/mmap-ed only for the relevant (e.g. actually used/collected memory of the heap). Not sure how much that buys, but in the case above where from-space is not collected, we don’t need to allocate a BOT for it.
- The BOT could be made coarser. I only chose 64-word blocks because 1. Literature said it’s a reasonable number and 2. Counting the bits in a block causes at most a single load from the marking bitmap. Increasing block-size to 128 would cut the BOT in half.
- I think the marking bitmap could only be made smaller if object alignment is increased, but I’ll need to think about this a little.

>> - Do you know of any targeted benchmarks that specifically measure full-GC performance? So far I only used this benchmark, which exxagerates the performance impact of the full-GC code, and measure the average time reported for the full-GC phases:
> 
> I actually have a few different small SystemGC benchmarks that I used
> way back when I did the parallel Full GC for G1. The plan was to
> contribute them at some point but haven't happend yet :) I dug them out
> and updated them a bit to make sure the run as intended out of the box.
> They are very similar to the test you have used but a few more
> scenarios. Not at all testing everything, but something to give some
> peace of mind when it comes to performance. The branch is here:
> https://github.com/openjdk/jdk/compare/master...kstefanj:jdk:system-gc-benchmarks
> 
> They are based on JMH and uses the single shot mode. Run them either
> through the make files:
> make test TEST=micro:org.openjdk.bench.vm.gc.SystemGC*
> Or when you have the benchmarks jar:
> java -jar benchmarks.jar SystemGC
> 
> The tests generate about 1 GB of objects, so to ensure that no GCs occur
> before the actual System.gc() and usual run the benchmarks with:
> -Xmx5g -Xms5g -Xmn3g
> 
> Hope this helps. Though, I don't think performance is what we will
> discuss the most around this change.

Thanks, I will try them. This looks quite useful!

Thanks for your feedback!

Cheers,
Roman

Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879