RFR: 8186571: Implementation: JEP 307: Parallel Full GC for G1

Mon Sep 4 15:36:58 UTC 2017

Hi,

Please review the implementation of JEP-307:
https://bugs.openjdk.java.net/browse/JDK-8172890

Webrev:
http://cr.openjdk.java.net/~sjohanss/8186571/hotspot.00/

Summary:
As communicated late last year [1], I've been working on parallelizing 
the Full GC for G1. The implementation is now ready for review.

The approach I chose was to redo marking at the start of the Full GC and 
not reuse the marking information from the concurrent mark cycle. The 
main reason behind this is to maximize the chance of freeing up memory. 
I reused the marking bitmap from the concurrent mark code though, so 
instead of marking in the mark word a bitmap is used. The mark word is 
still used for forwarding pointers, so marks will still have to be 
preserved for some objects.

The algorithm is still a four phased mark-compact but each phase is 
handled by parallel workers. Marking and reference processing is done in 
phase 1. In phase 2 all worker threads work through the heap claiming 
regions which they prepare for compaction. This is done by installing 
forwarding pointers into the mark word of the live objects that will 
move. The regions claimed by a worker in this phase will be the same 
regions that the worker will compact in phase 4. This ensures that 
objects are not overwritten before compacted.

In phase 3, all pointers to other objects are updated by looking at the 
forwarding pointers. At this point all information needed to create new 
remembered sets is available and this rebuilding has been added to phase 
3. In the old version remembered set rebuilding was done separately 
after the compaction, but this is more efficient.

As mentioned phase 4 is when the compaction is done. In this first 
version, to avoid some complexity, there is no work stealing in this 
phase. This will lead to some imbalance between the workers, but this 
can be treated as a separate RFE in the future.

The part of this work that has generated the most questions during 
internal discussions are the serial parts of phase 2 and 4. They are 
executed if no regions are to be freed up by the parallel workers. It is 
kind of a safety mechanism to avoid throwing a premature OOM. In the 
case of no regions being freed by the parallel code path a single 
threaded pass over the last region of each worker is done (at most 
number-of-workers regions are handled) to further compact these regions 
and hopefully free up some regions.

Testing:
* A lot of local sanity testing, both functional and performance.
* Passed tier 1-5 of internal testing on supported platforms.
* No regressions in performance testing.

Cheers,
Stefan

[1] 
http://mail.openjdk.java.net/pipermail/hotspot-gc-dev/2016-November/019216.html