RFC: Throughput barriers for G1

Wed Nov 9 15:11:18 UTC 2016

Hi all,

one of the concerns with using G1 has been the throughput reductions due 
to the costly (post-)barriers and refinement.

This idea proposes the use of the same write post-barrier for G1 as for 
the other collectors, and disable concurrent refinement. This improves 
throughput at the cost of predictability, as the concurrent refinement 
needs to be performed in a GC pause like in other collectors.

Background:
The G1 write-barrier consists of two parts, the pre-write barrier and 
the post-write barrier. For a mental model, the barrier (vastly 
simplified) looks like the following in "pseudo-C++":

// Barrier for a write like: o.x = y;

// The pre-write barrier due to SATB
if (conc_mark_is_active) {
   if (o.x != NULL) {
    add_to_satb_queue(&o.x);
   }
}

// The actual write
o.x = y;

// The post-write barrier to keep track of pointers between regions
if (region(o) != region(y)) {
   if (y != NULL) {
     if (card(o.x) != Young) {
       StoreLoad();
       if (card(o.x) != Dirty) {
         card(o.x) = Dirty;
         add_to_refinement_queue(card(o.x));
       }
     }
   }
}

As far as we know (based on performance runs) the pre-write part of the 
barrier is rarely a throughput problem (but we would of course 
appreciate if others confirmed this).  The problems with the post-write 
barrier for throughput performance are two:
- the sheer size of the post-write barrier (the number of assembly
   instructions)
- the logic of the post-write barrier (the branches and also the adding
   a card to the refinement queue)

The responsibility of the post-write barrier is to queue up pointers 
between regions so that the concurrent refinements threads can update 
the remembered sets concurrently.  If you were to give up this, an 
alternative post-write barrier could look like:

if (y != NULL) {
   if (card(o.x) == Clean) {
       card(o.x) = Dirty;
   }
}

The above post-write barrier will result in better throughput because 
the barrier consists of fewer instructions, less branches and (in 
particular) no enqueuing.  The concurrent refinement threads will also 
be turned off with this kind of post-write barrier, which will further 
increase throughput.

However, this is a trade-off. The cards will now have to be refined 
during a STW collection pause, which will increase the time of the 
pause. For a certain kind of applications, this trade-off might be worth 
it, especially if the heap size isn't too big (the size of the card 
table scales with the heap size). G1 would still be able to 
incrementally compact the heap in order to avoid Full GCs.

One may still add the cross-region check in the barrier to decrease the 
number of cards to process in the GC pause.

One optimization to the refinement in the GC pause may be to delay 
refinement for cards that do not contain references into the collection 
set to after the pause.

An alternative or addition could be to work on decreasing the overhead 
of the post-barrier by improving the compiler to decrease code size and 
reconsider ideas to remove the StoreLoad like suggested earlier (see 
e.g. 
http://mail.openjdk.java.net/pipermail/hotspot-gc-dev/2014-December/011666.html).

Thanks,
Erik