RFC: New Serial Full GC

Fri Feb 9 13:21:56 UTC 2024

Hi Roman,

On 25.01.24 12:04, Kennke, Roman wrote:
 > Hello GC engineers,
 >
 > (This is my 3rd attempt to send this email, this time from my 
personal account. Apologies if the message is delivered more than once, 
in case the first two make it through, eventually.)
 >
 >[...]
 >
 > Therefore I’ve been looking around in the literature and found an 
approach that the GC handbook calls ‘compressor’ so I used that name, 
too :-)
 >

Interesting to try to out :)

 > The basic idea is to compute the forwarding location for each object 
on the fly by using a marking bitmap and a (sort-of) block-offset-table. 
For full description of the algorithm, see the file serialCompressor.hpp 
in draft PR:
 >
 > https://github.com/openjdk/jdk/pull/17312
 >
 > (For those who know, this is pretty much the algorithm that is 
already used by the Parallel GC, modulo the stuff to make it work with 
multiple worker threads.)
 >
 > The prototype is fairly complete and almost ready for review, but I 
wanted to get opinions from the experts, first, before I spend any more 
time on it.
 >
 > In particular:
 > - Do you think it is feasible at all to do this?

The algorithm and implementation looks interesting as it seems easier to 
understand and the implementation much cleaner than the existing.

 > Performance seems to be on-par with (or even slightly better than) 
the current mark-compact algorithm.

Unfortunately my testing does not support that. The algorithm is 
competitive only when the space to be compacted is either almost empty 
or almost full (like in that example). The implementation (after fixing 
it, see below) does extremely badly if the heap to compact is half-full 
(>3x slower).

I attached a new "Retain.java" that one can pass a "step" value, i.e. a 
percentage of objects to drop.

(Running with java -Dstep=$i -XX:MarkSweepAlwaysCompactCount=1 -Xmx4g 
-Xms4g -XX:+UseSerialGC -XX:+UnlockExperimentalVMOptions 
-XX:+UseCompressorFullGC -Xlog:gc,gc+phases Retain)

I understand that this is a fairly simplistic case where objects are 
evenly sized and small, and the fragmentation very regular.

Further, the forwarding table building is buggy :)

It is possible with some particular setup of objects, that the while 
loop at 
https://github.com/openjdk/jdk/pull/17312/files#diff-f51ebb40a99f04f6945309f639e3c03d953daee74ef38c13d85c33d21d7fd0f3R227 
only terminates at the end of the space (for every live object).

Something like:

       while (addr_to_block_idx(chunk_end - 1) == 
addr_to_block_idx(next_live) && addr_to_block_idx(chunk_end - 1) == 
addr_to_block_idx(next_dead) && next_dead < top) {

seems to work; the "chunk_end - 1" is to bail out immediately on chunks 
ending at a block boundary (i.e. there can be no straggler), and the 
other change is that we need to check both next_live/next_dead against 
the current chunk end.

Having analyzed the phases of the new algorithm a bit, following 
observations:

- the new marking phase can be significantly slower (50-100%) than the 
old one with any significant amount of live objects. This is both 
because of using an external bitmap (some very old measurements of mine 
suggest that just by making the liveness information external, incurs 
~10% of overhead everything else the same), and likely the additional 
mark work.
I did not look at it in detail, there may be inlining issues etc.

- the build table phase is very slow (and very wasteful in what it does).

   * Probably a block centric algorithm that does not redo all the work 
for all objects in a single block should improve the situation a lot 
because....

   * it would help understanding of the code to use more helper methods 
(or more comments) in the main loop.

   * the code to determine the estimate about the liveness, i.e. the 
while loop could be easily improved to not do any bitmap searching 
because a) we know that if the bitmap contains a mark at the next block 
boundary there is a chunk/object overlapping it and b) if there is a 
overlapping chunk we can easily determine the length of the object 
straggling by querying it (using the real BOT, at least for old gen).

   * the block marking part re-calculates the forwarding value for every 
object as well which is unnecessary; we need the forwarding address of 
the first live word in the block and that is the same for all live 
objects in the same block.

Fwiw, the live-estimate part of this part of table building is by far 
the most expensive.

E.g. random measurements for step=50 with the Retain test (values in ms):

[21,346s][info][gc       ] GC(21) finding next live object: 114,25 
estimate liveness: 503,50 update block table: 98,36

Something like (still doing some bitmap searching):

       HeapWord* next_block_addr = align_up(chunk_end, bytes_per_block());
       size_t live_in_block;
       if (_mark_bitmap.is_marked(next_block_addr)) {
         // maybe use BOT for old gen....
         live_in_block = 
pointer_delta(_mark_bitmap.get_next_unmarked_addr(next_block_addr, top), 
current);
       } else {
         live_in_block = pointer_delta(next_block_addr, current);
       }

(the pre-check whether the bit is actually unnecessary, but is ~5% 
faster than relying on just get_next_unmarked_addr())

seems to make that part 5x faster for worst case (it's a bit slower when 
everything is live).

See also the attached graph.

- the compaction phase seems to be fairly the same performance as phases 
3+4 of the old algorithm, but did not check in detail at all.

- I recommend calling this new table "[block] forwarding table" and not 
block offset table because that term is already taken, and it does not 
contain offsets anyway.

There may be more (a lot of?) low hanging fruits in all parts of the 
implementation.

 > - Any other concerns that might block this effort or make it unfeasible?

Currently, performance seems just too bad :)

Other than that the code (esp. the table building) would need some care 
imo, i.e. using some helper methods and or documentation why something 
is done that way.

Particularly for the liveness estimate part of table building it is imo 
non-obvious why things are done this way (and actually, initially I 
thought: wtf is the purpose of this code given it's jammed inbetween?) 
And whether the error in the estimate (it is possible to do this exactly 
with the original loop) is intentional etc.

Note that I do not have the current GC handbook on hand, and this 
adaption to multiple spaces does not seem to be in the papers either. 
(It would be nice to reference them using proper references too).

However I guess some performance work needs to be done here still.

I will probably play around with the implementation a bit more, trying 
out Stefan's full gc jmh, and looking at other platforms than just x86. 
It would also be interesting to see the tradeoff between different sizes 
of the forwarding table (i.e. every entry covering more than 512 bytes).

Hth,
   Thomas
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 20240209-retain-benchmark-serial-system-gc.png
Type: image/png
Size: 43188 bytes
Desc: not available
URL: <https://mail.openjdk.org/pipermail/hotspot-gc-dev/attachments/20240209/79ed25cf/20240209-retain-benchmark-serial-system-gc-0001.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Retain.java
Type: text/x-java
Size: 1168 bytes
Desc: not available
URL: <https://mail.openjdk.org/pipermail/hotspot-gc-dev/attachments/20240209/79ed25cf/Retain-0001.java>