RFR: SATB compaction hides unmarked objects until final-mark

Tue Jun 19 13:44:50 UTC 2018

http://cr.openjdk.java.net/~shade/shenandoah/satb-prompt/webrev.02/

Current SATB filtering code is striving to avoid enqueueing buffers from SATB barriers into the
global list, if that buffer contains a lot of non-interesting objects. In G1/Shenandoah case, it
filters out already marked objects, does two-finger compaction, and then decides if it wants to
return the buffer back to mutator to fill with more data. In many cases, it returns the pristine
buffer back.

But, it comes with an interesting caveat: if there is an unmarked object surrounded by
already-marked objects that get filtered all the time, there is a significant chance that unmarked
objects would be never shown to the GC code. In Shenandoah, we would discover that object only
during final-mark, when we drain all SATB buffers, regardless of filtering.

In some interesting workloads, that hidden object might be a large oop array, scanning which affects
final-mark times. Also, even if object is not very heavy-weight, marking it eagerly makes the
subsequent filtering more efficient. There is a significant chance that we would touch bitmaps on
filter-compact all the time for objects below enqueueing threshold.

The way out of this is to cap the number of times we take the "not-enqueue" shortcut, and enqueue
the buffer when that cap is reached. I chose 50 taken shortcuts as the threshold that works well in
my experiments.

(My very first experiment was taking time since last enqueue as the threshold. That feels more
reliable, but it queries time on critical path, and that was a potential scalability bottleneck.)

For example, one of our benchmarks:

Before:

  Pause Final Mark (G)    =  1.22 s (a =    10895 us)
  Pause Final Mark (N)    =  0.95 s (a =     8483 us)
    Finish Queues         =  0.84 s (a =     7458 us)
    Weak References       =  0.02 s (a =      739 us)
      Process             =  0.02 s (a =      733 us)
    Prepare Evacuation    =  0.06 s (a =      515 us)
    Initial Evacuation    =  0.03 s (a =      300 us)
      E: Thread Roots     =  0.02 s (a =      194 us)
      E: Code Cache Roots =  0.00 s (a =       43 us)

After:

  Pause Final Mark (G)    =  0.06 s (a =     2677 us)
  Pause Final Mark (N)    =  0.03 s (a =     1217 us)
    Finish Queues         =  0.01 s (a =      248 us)  <--- (1)
    Weak References       =  0.00 s (a =      361 us)  <--- (2)
      Process             =  0.00 s (a =      355 us)
    Prepare Evacuation    =  0.01 s (a =      491 us)
    Initial Evacuation    =  0.01 s (a =      365 us)
      E: Thread Roots     =  0.00 s (a =      182 us)
      E: Code Cache Roots =  0.00 s (a =       38 us)

(1): Significantly less final-mark queue work, because most hidden objects are now discovered during
concurrent mark
(2): Apparently, concurrent precleaning works better, because more hidden objects got marked in
concurrent phase, and then concurrent precleaning piggybacked on it.

Testing: tier3_gc_shenandoah, specjbb

Thanks,
-Aleksey