RFC: New Serial Full GC

Thu Jan 25 11:04:04 UTC 2024

Hello GC engineers,

(This is my 3rd attempt to send this email, this time from my personal account. Apologies if the message is delivered more than once, in case the first two make it through, eventually.)

As part of the Lilliput project I am currently implementing a new full-GC for the Serial GC. The background of this is that the current implementation, a mark-compact algorithm, overrides the header of objects while compacting objects, because it needs to store the forwarding pointers into the header. So far this hasn’t been a big deal, because in most cases the header is not very interesting. It is only interesting if an object is locked or has an identity hash-code, which is usually very few. In those cases, the header is stored in a side-structure and restored after GC. However, with Lilliput, the narrow Klass-pointer lives in the object header, which means that all headers are interesting, and in-fact critical, even during GC.

A while ago I’ve come up with a partial solution which is good enough for 64-bit object headers, which works by encoding the forwarding-pointer such that it fits into the lower part of the header, while leaving the upper part (where the nKlass lives) alone. That implementation is here: https://github.com/openjdk/jdk/pull/13582

However, going forward, this is not a really sustainable solution, either. In Lilliput we eventually want to implement 32-bit object headers (and, as it turns out, we are already pretty close to that). In such a scenario, there would simply be no space for both an nKlass *and* a forwarding pointer in the header. So we need something that doesn’t require to store a forwarding pointer in the header to begin with.

An obvious approach would be to store forwarding information in a side-table (a simple hash-table would do that). I implemented a prototype, but it became clear pretty soon that 1. performance could never compete with the current solution and 2. memory requirements are not practical, either.

Therefore I’ve been looking around in the literature and found an approach that the GC handbook calls ‘compressor’ so I used that name, too :-)

The basic idea is to compute the forwarding location for each object on the fly by using a marking bitmap and a (sort-of) block-offset-table. For full description of the algorithm, see the file serialCompressor.hpp in draft PR:

https://github.com/openjdk/jdk/pull/17312

(For those who know, this is pretty much the algorithm that is already used by the Parallel GC, modulo the stuff to make it work with multiple worker threads.)

The prototype is fairly complete and almost ready for review, but I wanted to get opinions from the experts, first, before I spend any more time on it.

In particular:
- Do you think it is feasible at all to do this? Performance seems to be on-par with (or even slightly better than) the current mark-compact algorithm. But it requires more memory: ~1.5% of the heap for the marking bitmap and another ~1.5% for the block-offset-table. Those are maximum values. This might be a concern, given that one of the advantages of Serial GC is that it requires little extra memory. On the other hand: 1. It currently preserves headers on the side. While this usually doesn’t need much space, some workloads churn locks and/or mass-i-hash objects, in which case the preserved-marks-table can become *much* larger than the extra 3%. 2. Shaving-off a whole word per object with Lilliput’s 32-bit headers is going to save more space than the extra structures take away, so this is still going to be a net-plus, especially because the savings are during the whole lifetime of an application, while the extra memory is only needed during (rare) full-GCs.
- Do you know of any targeted benchmarks that specifically measure full-GC performance? So far I only used this benchmark, which exxagerates the performance impact of the full-GC code, and measure the average time reported for the full-GC phases:

https://github.com/openjdk/jdk/pull/13582#issuecomment-1533133187

Performance is pretty much like existing implementation:
- mark-compact: 368ms
- compressor: 367ms

For completeness and comparison:

- G1 GC: 457.44ms single-threaded (154.61ms multi-threaded)
- Parallel GC: 883.18ms single-threaded (432.26ms multi-threaded) (yes, that is concerning, see https://bugs.openjdk.org/browse/JDK-8320165)
- Shenandoah: 710.96ms single-threaded (85.50ms multi-threaded)

ZGC doesn’t use any full-GC.

- Any other concerns that might block this effort or make it unfeasible?

Going forward, and if you agree with the approach, I would:
- Clean up the PR and open it for review.
- Implement the similar algorithm for Shenandoah and G1. The algorithm itself should be pretty straightforward to make multi-threaded, and with Shenandoah and G1, doesn’t require as much extra memory, because they already use marking-bitmaps anyway. We may have to throw the serial full GC part of G1 under the bus - I currently don’t see how to encode ‘random’ forwarding. If it’s important, we may use slowish hash-table for serial-full-GC-forwardings - at that point, performance is out of the window anyway, so probably doesn’t matter.

Let me know what you think!

Cheers,
Roman

Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879