RFC: New Serial Full GC

Wed Jan 31 13:31:25 UTC 2024

Hi Roman,

On 2024-01-30 18:28, Kennke, Roman wrote:
> Hi Stefan,
> 
> Some replies inline:
> 
>>> In particular:
>>> - Do you think it is feasible at all to do this? Performance seems to be on-par with (or even slightly better than) the current mark-compact algorithm. But it requires more memory: ~1.5% of the heap for the marking bitmap and another ~1.5% for the block-offset-table. Those are maximum values. This might be a concern, given that one of the advantages of Serial GC is that it requires little extra memory. On the other hand: 1. It currently preserves headers on the side. While this usually doesn’t need much space, some workloads churn locks and/or mass-i-hash objects, in which case the preserved-marks-table can become *much* larger than the extra 3%. 2. Shaving-off a whole word per object with Lilliput’s 32-bit headers is going to save more space than the extra structures take away, so this is still going to be a net-plus, especially because the savings are during the whole lifetime of an application, while the extra memory is only needed during (rare) full-GCs.
>>
>> My main concern would not be the performance of the Full GC but the
>> additional memory. As you say, this would remove a big part of Serial
>> GCs main/only advantage, the low memory overhead. Even if the additional
>> memory is only used during the Full GC I would assume that the peak
>> overhead is what's important in many cases.
> 
> Righ, that is a valid concern. I asked around a little bit here (team and customers) and the overall feeling was that a predictable small-ish x% max overhead is perhaps better than an unpredictable overhead that occurs when a workload mass-i-hashes objects and/or churns locks. When that happens, the preserved-marks-table can become quite large. If you have to model for that unknown bad case, you have to take a much higher peak into account.
> 

I agree, I would also prefer predictable and I'm not really against this 
change. Just hard to know how many (if any) will se a problematic 
regression due to the memory increase.

>> My first thought when reading this was, if we add this overhead to
>> Serial is it still worth to keep it around? Could we have a single
>> threaded mode of G1 that has similar overhead. Did some quick runs using
>> a small test and a 5G heap. I got these overhead numbers using NMT:
>> Baseline Serial: ~14MB
>> New Serial: ~14MB (Peak: ~174MB)
>> G1: ~167MB
>> G1 + ParallelGCThreads=1: ~138MB
> 
> Those numbers looks like what I would expect, yes. Can you share the program? Or try the other extreme and I-hash all the objects that you allocate?
> 

This was just using one of the below benchmarks :) So nothing that would 
blow up the preserved marks. Could be a good benchmark to add. Done :) 
But Serial seems to be able to store the preserved mark on the heap for 
System GCs, so not really showing anything. Still makes sense to add, 
but had to let half the objects die due to the optimization you added a 
few weeks back for G1. Otherwise G1 didn't do any preserving either.

Did create a small standalone test that just allocates and hashes and 
that really blows up old serial as expected. Again, G1 looks ok after we 
only preserve marks for moving objects. Here's the test:
public class HashStress {
     public static void main(String[] args) throws Exception {
       int length = 25_000_000;
       Object[] array = new Object[length];
       for (int i = 0; i < 4 * length; i++) {
         Object obj = new Object();
         obj.hashCode();
         array[i%length] = obj;
       }
     }
}

Running it with a 1g heap gives those NMT GC peak values:
Baseline Serial: ~386MB
New Serial:       ~55MB
G1 (only 1t):     ~97MB

I think numbers like this is good to have. They show a clear benefit 
with the new predictable overhead even though this is probably a pretty 
rare usecase.

>> This is of course just one very simple data point and G1 would likely
>> consume additional memory in a real world use case (and would also need
>> to be rewritten using a similar technique). So I'm not proposing to drop
>> Serial instead of doing this new full GC implementation, but I think
>> it's important to think along those lines. To me a big reason for having
>> the Serial collector is use cases where you don't care about GC
>> performance but want low overhead.
> 
> Correct. And/or only have 1 or very few cores (e.g. small containers). Consider that we’re doing this to support compact object headers, which typically save more memory than the extra overhead taken by this Serial Full GC implementation.
> 

Agreed. I don't argue this.

> Also for context: going to 8-byte headers does not strictly require that new full-GC, as long as we can squeeze the forwarding pointers into the lower 32bits of the header. I provided one way to do that here: https://urldefense.com/v3/__https://github.com/openjdk/jdk/pull/13582__;!!ACWV5N9M2RV99hQ!KPgPYbSW-qEWXpPNkZfnu3dZe3vHFCwMrd3f5KVcbntotEZLZOxecdAo53KwQujnpD0yE0Whnw-TWAeuMMIs_Ho$ . Another alternative would be to simply use compressed oops, as long as the heap is small enough. However, if/when we want to go to 4-byte headers (which I certainly want to do soon-ish), when we don’t have space for any forwarding pointer in the header and need an alternative. I studied what Parallel GC does and read around in the literature, and what I proposed here looked to me like the best compromise.
> 

I also think I prefer this approach to the sliding forwarding, 
regardless of the size of the headers. Still depends a little on how the 
implementation for G1 will look. Not sure if keeping regions will make 
things more complicated or not.

> I’ve also got some ideas how to reduce memory usage of the algorithm:
> - It looks to me like when full-GC is not triggered by promotion failure, it does not collect (young) from-space. If from-space has enough room, we could try to place the BOT and/or the marking bitmap there. I don’t know how common that is, I haven’t even dug into it enough to know under which scenario this happens and if from-space would typically have space left.

This is what's currently used by serial to save space for the preserved 
marks. But the to_space right, from_space will have objects. Looks like 
just trying to use DefNewGeneration::contribute_scratch(...) and see if 
it is big enough should be pretty straight forward.

> - Marking bitmap and BOT could be allocated/mmap-ed only for the relevant (e.g. actually used/collected memory of the heap). Not sure how much that buys, but in the case above where from-space is not collected, we don’t need to allocate a BOT for it.
> - The BOT could be made coarser. I only chose 64-word blocks because 1. Literature said it’s a reasonable number and 2. Counting the bits in a block causes at most a single load from the marking bitmap. Increasing block-size to 128 would cut the BOT in half.
> - I think the marking bitmap could only be made smaller if object alignment is increased, but I’ll need to think about this a little.

Sounds like good things to investigate.

> 
>>> - Do you know of any targeted benchmarks that specifically measure full-GC performance? So far I only used this benchmark, which exxagerates the performance impact of the full-GC code, and measure the average time reported for the full-GC phases:
>>
>> I actually have a few different small SystemGC benchmarks that I used
>> way back when I did the parallel Full GC for G1. The plan was to
>> contribute them at some point but haven't happend yet :) I dug them out
>> and updated them a bit to make sure the run as intended out of the box.
>> They are very similar to the test you have used but a few more
>> scenarios. Not at all testing everything, but something to give some
>> peace of mind when it comes to performance. The branch is here:
>> https://urldefense.com/v3/__https://github.com/openjdk/jdk/compare/master...kstefanj:jdk:system-gc-benchmarks__;!!ACWV5N9M2RV99hQ!KPgPYbSW-qEWXpPNkZfnu3dZe3vHFCwMrd3f5KVcbntotEZLZOxecdAo53KwQujnpD0yE0Whnw-TWAeu2uBpbR4$
>>
>> They are based on JMH and uses the single shot mode. Run them either
>> through the make files:
>> make test TEST=micro:org.openjdk.bench.vm.gc.SystemGC*
>> Or when you have the benchmarks jar:
>> java -jar benchmarks.jar SystemGC
>>
>> The tests generate about 1 GB of objects, so to ensure that no GCs occur
>> before the actual System.gc() and usual run the benchmarks with:
>> -Xmx5g -Xms5g -Xmn3g
>>
>> Hope this helps. Though, I don't think performance is what we will
>> discuss the most around this change.
> 
> Thanks, I will try them. This looks quite useful!

Sounds good, I will make sure to get them contributed for real this time :)

Thanks,
StefanJ

> 
> Thanks for your feedback!
> 
> Cheers,
> Roman
> 
> 
> 
> Amazon Development Center Germany GmbH
> Krausenstr. 38
> 10117 Berlin
> Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
> Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
> Sitz: Berlin
> Ust-ID: DE 289 237 879
> 
>