RFR: SATB compaction hides unmarked objects until final-mark

Tue Jun 19 13:51:53 UTC 2018

Wow, very nice.

Patch looks ok.

We need to see and decide soon-ish what we do with our divergences in
SATB ptr queue. Fork it? Upstream the changes? But not now.

Roman

> http://cr.openjdk.java.net/~shade/shenandoah/satb-prompt/webrev.02/
> 
> Current SATB filtering code is striving to avoid enqueueing buffers from SATB barriers into the
> global list, if that buffer contains a lot of non-interesting objects. In G1/Shenandoah case, it
> filters out already marked objects, does two-finger compaction, and then decides if it wants to
> return the buffer back to mutator to fill with more data. In many cases, it returns the pristine
> buffer back.
> 
> But, it comes with an interesting caveat: if there is an unmarked object surrounded by
> already-marked objects that get filtered all the time, there is a significant chance that unmarked
> objects would be never shown to the GC code. In Shenandoah, we would discover that object only
> during final-mark, when we drain all SATB buffers, regardless of filtering.
> 
> In some interesting workloads, that hidden object might be a large oop array, scanning which affects
> final-mark times. Also, even if object is not very heavy-weight, marking it eagerly makes the
> subsequent filtering more efficient. There is a significant chance that we would touch bitmaps on
> filter-compact all the time for objects below enqueueing threshold.
> 
> The way out of this is to cap the number of times we take the "not-enqueue" shortcut, and enqueue
> the buffer when that cap is reached. I chose 50 taken shortcuts as the threshold that works well in
> my experiments.
> 
> (My very first experiment was taking time since last enqueue as the threshold. That feels more
> reliable, but it queries time on critical path, and that was a potential scalability bottleneck.)
> 
> For example, one of our benchmarks:
> 
> Before:
> 
>   Pause Final Mark (G)    =  1.22 s (a =    10895 us)
>   Pause Final Mark (N)    =  0.95 s (a =     8483 us)
>     Finish Queues         =  0.84 s (a =     7458 us)
>     Weak References       =  0.02 s (a =      739 us)
>       Process             =  0.02 s (a =      733 us)
>     Prepare Evacuation    =  0.06 s (a =      515 us)
>     Initial Evacuation    =  0.03 s (a =      300 us)
>       E: Thread Roots     =  0.02 s (a =      194 us)
>       E: Code Cache Roots =  0.00 s (a =       43 us)
> 
> After:
> 
>   Pause Final Mark (G)    =  0.06 s (a =     2677 us)
>   Pause Final Mark (N)    =  0.03 s (a =     1217 us)
>     Finish Queues         =  0.01 s (a =      248 us)  <--- (1)
>     Weak References       =  0.00 s (a =      361 us)  <--- (2)
>       Process             =  0.00 s (a =      355 us)
>     Prepare Evacuation    =  0.01 s (a =      491 us)
>     Initial Evacuation    =  0.01 s (a =      365 us)
>       E: Thread Roots     =  0.00 s (a =      182 us)
>       E: Code Cache Roots =  0.00 s (a =       38 us)
> 
> (1): Significantly less final-mark queue work, because most hidden objects are now discovered during
> concurrent mark
> (2): Apparently, concurrent precleaning works better, because more hidden objects got marked in
> concurrent phase, and then concurrent precleaning piggybacked on it.
> 
> Testing: tier3_gc_shenandoah, specjbb
> 
> Thanks,
> -Aleksey
>