RFC: New Serial Full GC

Roman Kennke rkennke at gmail.com
Fri Feb 9 18:27:03 UTC 2024


Hi Thomas,

Thank you for the thorough analysis!

I took another shot to rewrite the table-building. This new approach is strictly block-centric. I.e. it scans the spaces block-by-block (and only those that actually have objects starting in them!). It requires some scanning around block-boundaries to find the first object that starts in a block, but that seems ok. That phase is now down to just a few ms. The code is also much cleaner and easier to understand.

Performance is still not *quite* there, but almost, and I think I can squeeze some more out of it. See attached graph:

-------------- next part --------------
A non-text attachment was scrubbed...
Name: PastedGraphic-1.png
Type: image/png
Size: 27613 bytes
Desc: not available
URL: <https://mail.openjdk.org/pipermail/hotspot-gc-dev/attachments/20240209/8849b16a/PastedGraphic-1-0001.png>
-------------- next part --------------


I pushed that new code to the PR. Let me know what you think.

https://github.com/openjdk/jdk/pull/17312

Have a good weekend!

Roman


> On Feb 9, 2024, at 2:21?PM, Thomas Schatzl <thomas.schatzl at oracle.com> wrote:
> 
> Hi Roman,
> 
> On 25.01.24 12:04, Kennke, Roman wrote:
> > Hello GC engineers,
> >
> > (This is my 3rd attempt to send this email, this time from my personal account. Apologies if the message is delivered more than once, in case the first two make it through, eventually.)
> >
> >[...]
> >
> > Therefore I?ve been looking around in the literature and found an approach that the GC handbook calls ?compressor? so I used that name, too :-)
> >
> 
> Interesting to try to out :)
> 
> > The basic idea is to compute the forwarding location for each object on the fly by using a marking bitmap and a (sort-of) block-offset-table. For full description of the algorithm, see the file serialCompressor.hpp in draft PR:
> >
> > https://github.com/openjdk/jdk/pull/17312
> >
> > (For those who know, this is pretty much the algorithm that is already used by the Parallel GC, modulo the stuff to make it work with multiple worker threads.)
> >
> > The prototype is fairly complete and almost ready for review, but I wanted to get opinions from the experts, first, before I spend any more time on it.
> >
> > In particular:
> > - Do you think it is feasible at all to do this?
> 
> The algorithm and implementation looks interesting as it seems easier to understand and the implementation much cleaner than the existing.
> 
> > Performance seems to be on-par with (or even slightly better than) the current mark-compact algorithm.
> 
> Unfortunately my testing does not support that. The algorithm is competitive only when the space to be compacted is either almost empty or almost full (like in that example). The implementation (after fixing it, see below) does extremely badly if the heap to compact is half-full (>3x slower).
> 
> I attached a new "Retain.java" that one can pass a "step" value, i.e. a percentage of objects to drop.
> 
> (Running with java -Dstep=$i -XX:MarkSweepAlwaysCompactCount=1 -Xmx4g -Xms4g -XX:+UseSerialGC -XX:+UnlockExperimentalVMOptions -XX:+UseCompressorFullGC -Xlog:gc,gc+phases Retain)
> 
> I understand that this is a fairly simplistic case where objects are evenly sized and small, and the fragmentation very regular.
> 
> Further, the forwarding table building is buggy :)
> 
> It is possible with some particular setup of objects, that the while loop at https://github.com/openjdk/jdk/pull/17312/files#diff-f51ebb40a99f04f6945309f639e3c03d953daee74ef38c13d85c33d21d7fd0f3R227 only terminates at the end of the space (for every live object).
> 
> Something like:
> 
>      while (addr_to_block_idx(chunk_end - 1) == addr_to_block_idx(next_live) && addr_to_block_idx(chunk_end - 1) == addr_to_block_idx(next_dead) && next_dead < top) {
> 
> seems to work; the "chunk_end - 1" is to bail out immediately on chunks ending at a block boundary (i.e. there can be no straggler), and the other change is that we need to check both next_live/next_dead against the current chunk end.
> 
> Having analyzed the phases of the new algorithm a bit, following observations:
> 
> - the new marking phase can be significantly slower (50-100%) than the old one with any significant amount of live objects. This is both because of using an external bitmap (some very old measurements of mine suggest that just by making the liveness information external, incurs ~10% of overhead everything else the same), and likely the additional mark work.
> I did not look at it in detail, there may be inlining issues etc.
> 
> - the build table phase is very slow (and very wasteful in what it does).
> 
>  * Probably a block centric algorithm that does not redo all the work for all objects in a single block should improve the situation a lot because....
> 
>  * it would help understanding of the code to use more helper methods (or more comments) in the main loop.
> 
>  * the code to determine the estimate about the liveness, i.e. the while loop could be easily improved to not do any bitmap searching because a) we know that if the bitmap contains a mark at the next block boundary there is a chunk/object overlapping it and b) if there is a overlapping chunk we can easily determine the length of the object straggling by querying it (using the real BOT, at least for old gen).
> 
>  * the block marking part re-calculates the forwarding value for every object as well which is unnecessary; we need the forwarding address of the first live word in the block and that is the same for all live objects in the same block.
> 
> Fwiw, the live-estimate part of this part of table building is by far the most expensive.
> 
> E.g. random measurements for step=50 with the Retain test (values in ms):
> 
> [21,346s][info][gc       ] GC(21) finding next live object: 114,25 estimate liveness: 503,50 update block table: 98,36
> 
> Something like (still doing some bitmap searching):
> 
>      HeapWord* next_block_addr = align_up(chunk_end, bytes_per_block());
>      size_t live_in_block;
>      if (_mark_bitmap.is_marked(next_block_addr)) {
>        // maybe use BOT for old gen....
>        live_in_block = pointer_delta(_mark_bitmap.get_next_unmarked_addr(next_block_addr, top), current);
>      } else {
>        live_in_block = pointer_delta(next_block_addr, current);
>      }
> 
> (the pre-check whether the bit is actually unnecessary, but is ~5% faster than relying on just get_next_unmarked_addr())
> 
> seems to make that part 5x faster for worst case (it's a bit slower when everything is live).
> 
> See also the attached graph.
> 
> - the compaction phase seems to be fairly the same performance as phases 3+4 of the old algorithm, but did not check in detail at all.
> 
> - I recommend calling this new table "[block] forwarding table" and not block offset table because that term is already taken, and it does not contain offsets anyway.
> 
> There may be more (a lot of?) low hanging fruits in all parts of the implementation.
> 
> > - Any other concerns that might block this effort or make it unfeasible?
> 
> Currently, performance seems just too bad :)
> 
> Other than that the code (esp. the table building) would need some care imo, i.e. using some helper methods and or documentation why something is done that way.
> 
> Particularly for the liveness estimate part of table building it is imo non-obvious why things are done this way (and actually, initially I thought: wtf is the purpose of this code given it's jammed inbetween?) And whether the error in the estimate (it is possible to do this exactly with the original loop) is intentional etc.
> 
> Note that I do not have the current GC handbook on hand, and this adaption to multiple spaces does not seem to be in the papers either. (It would be nice to reference them using proper references too).
> 
> However I guess some performance work needs to be done here still.
> 
> I will probably play around with the implementation a bit more, trying out Stefan's full gc jmh, and looking at other platforms than just x86. It would also be interesting to see the tradeoff between different sizes of the forwarding table (i.e. every entry covering more than 512 bytes).
> 
> Hth,
>  Thomas
> <20240209-retain-benchmark-serial-system-gc.png><Retain.java>



More information about the hotspot-gc-dev mailing list