RFC: Throughput barriers for G1

Thu Nov 10 16:27:53 UTC 2016

On 11/09/2016 08:13 PM, Jungwoo Ha wrote:
> Would'n't it be faster to just do card(o.x) = Dirty without the clean check?

Sure, I think that is what Thomas was trying to say. The only 
requirement for the post-write barrier in this "mode" is to dirty a card 
when it is needed. There is no correctness issue with dirtying more 
cards than that. One could use multiple filters (like checking if o and 
y are in different regions, if y is NULL, etc) or just dirty the card. 
Whatever gives the best performance (this might differ depending on 
workload).

The post-write barrier can be designed to be almost identical to CMS 
(CMS dirties the card for o, not o.x, IIRC). However, if you use CMS 
with -XX:+UseCondCardMark, then CMS also has a StoreLoad in its 
post-write barrier (G1 would not need that StoreLoad with this idea).

On 11/09/2016 08:13 PM, Jungwoo Ha wrote:
> Usually single store is faster than load & store on the same cache line.
> I don't think card(o.x) is preloaded in cache to make if check cheap.

There are also reasons for limiting the number of writes to the card 
table. For example, see 
https://blogs.oracle.com/dave/entry/false_sharing_induced_by_card.

On 11/09/2016 08:13 PM, Jungwoo Ha wrote:
> Anyway this is a good step forward and I think it is a good tradeoff.
> However, pre & post is still heavier than the CMS wb, and we need to see
> the actual experimental numbers how close it can be.

You can start measuring without implementing the idea, the following 
experiment will show you the cost of the pre-write barrier for your 
workloads:
    1. Find an application that doesn't need mixed GCs (for example by
       increasing young gen size and max heap size). Alternatively you
       can just run any of your applications until OOME.
    2. Run the above application without generated pre-write barriers
       (and  concurrent mark and refinement turned off). This run
       becomes your baseline.
    3. Run the above benchmark with the pre-write barriers, but in the
       slow leaf call into the VM, discard the old buffer (but create a
       new one). Turn off concurrent mark and concurrent refinement.
       This run becomes your target.
    If you now compare your baseline and your target, you should
    essentially see the impact of just the pre-write barrier.

Since the post-write barrier can be made almost identical to CMS (see my 
paragraph above), the overhead of the pre-write barrier would then be 
the barrier overhead for G1.

Would you guys at Google be willing to help out with running these 
experiments? At the last CMS meeting Jeremy said that Google would be 
willing to help out with G1 improvements IIRC.

Thanks,
Erik

> On Wed, Nov 9, 2016 at 7:54 AM, Thomas Schatzl
> <thomas.schatzl at oracle.com <mailto:thomas.schatzl at oracle.com>> wrote:
>
>     Hi all,
>
>       just one comment:
>
>     On Wed, 2016-11-09 at 16:11 +0100, Erik Helin wrote:
>     > Hi all,
>     >
>     > [...]
>     >
>     > The responsibility of the post-write barrier is to queue up pointers
>     > between regions so that the concurrent refinements threads can update
>     > the remembered sets concurrently.  If you were to give up this, an
>     > alternative post-write barrier could look like:
>     >
>     > if (y != NULL) {
>     >    if (card(o.x) == Clean) {
>     >        card(o.x) = Dirty;
>     >    }
>     > }
>     >
>
>     Actually, one can simply reuse all the existing post-write barrier code
>     generation and optimizations from any of the other collectors in the
>     simplest case.
>
>     I do not think parallel GC performs the NULL check :)
>
>     Thanks,
>       Thomas
>
>
>
>
> --
> Jungwoo Ha | Java Platform Team | jwha at google.com <mailto:jwha at google.com>
>