RFC: Throughput barriers for G1

Wed Nov 16 15:53:29 UTC 2016

On 11/10/2016 09:04 PM, Jungwoo Ha wrote:
>
>     On 11/09/2016 08:13 PM, Jungwoo Ha wrote:
>
>         Usually single store is faster than load & store on the same
>         cache line.
>         I don't think card(o.x) is preloaded in cache to make if check
>         cheap.
>
>
>     There are also reasons for limiting the number of writes to the card
>     table. For example, see
>     https://blogs.oracle.com/dave/entry/false_sharing_induced_by_card
>     <https://blogs.oracle.com/dave/entry/false_sharing_induced_by_card>.
>
>
> store goes to the store buffer and marking the cache line with M state
> is done at the background.
> If the mutator doesn't read the card table at all, the cache line will
> stay M until the GC reads the cards (changing it to S), thus saving
> cache coherence traffic.
> Load will trigger the cache-line to most likely S state, which is a high
> latency load if the previous state is M, and the mutator is paying the
> long latency loads.
> I am not sure if there is any win with UseCondCardMark at least on x86.
> Adding a branch adds a potential overhead on branch prediction as well.
> You can probably use cmov instruction, but that's also not as cheap as
> ordinary mov.

Sure, adding a branch might not (or might) yield better results, I just 
wanted to highlight that there is room for experimentation here.

On 11/10/2016 09:04 PM, Jungwoo Ha wrote:
>     You can start measuring without implementing the idea, the following
>     experiment will show you the cost of the pre-write barrier for your
>     workloads:
>        1. Find an application that doesn't need mixed GCs (for example by
>           increasing young gen size and max heap size). Alternatively you
>           can just run any of your applications until OOME.
>        2. Run the above application without generated pre-write barriers
>           (and  concurrent mark and refinement turned off). This run
>           becomes your baseline.
>        3. Run the above benchmark with the pre-write barriers, but in the
>           slow leaf call into the VM, discard the old buffer (but create a
>           new one). Turn off concurrent mark and concurrent refinement.
>           This run becomes your target.
>        If you now compare your baseline and your target, you should
>        essentially see the impact of just the pre-write barrier.
>
>     Since the post-write barrier can be made almost identical to CMS
>     (see my paragraph above), the overhead of the pre-write barrier
>     would then be the barrier overhead for G1.
>
>     Would you guys at Google be willing to help out with running these
>     experiments? At the last CMS meeting Jeremy said that Google would
>     be willing to help out with G1 improvements IIRC.
>
>
> Sure, I can do the measurement with DaCapo benchmark suite. I don't
> think we can run this experiment with the production workload.

Alright, thanks for helping out. I would also be very useful if you have 
some form of internal benchmarks and/or applications that behaves like 
the workloads you are running today, just to get a clear picture of the 
potential improvement.

Thanks,
Erik