<div dir="ltr"><div class="gmail_extra"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class=""><br>

On 11/09/2016 08:13 PM, Jungwoo Ha wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

Usually single store is faster than load & store on the same cache line.<br>

I don't think card(o.x) is preloaded in cache to make if check cheap.<br>

</blockquote>

<br></span>

There are also reasons for limiting the number of writes to the card table. For example, see <a href="https://blogs.oracle.com/dave/entry/false_sharing_induced_by_card" rel="noreferrer" target="_blank">https://blogs.oracle.com/dave/<wbr>entry/false_sharing_induced_by<wbr>_card</a>.<span class=""><br>

<br></span></blockquote><div><br></div><div>store goes to the store buffer and marking the cache line with M state is done at the background.</div><div>If the mutator doesn't read the card table at all, the cache line will stay M until the GC reads the cards (changing it to S), thus saving cache coherence traffic.</div><div>Load will trigger the cache-line to most likely S state, which is a high latency load if the previous state is M, and the mutator is paying the long latency loads.</div><div>I am not sure if there is any win with UseCondCardMark at least on x86.<br></div><div>Adding a branch adds a potential overhead on branch prediction as well.</div><div>You can probably use cmov instruction, but that's also not as cheap as ordinary mov.</div><div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class=""><br></span>

You can start measuring without implementing the idea, the following experiment will show you the cost of the pre-write barrier for your workloads:<br>

   1. Find an application that doesn't need mixed GCs (for example by<br>

      increasing young gen size and max heap size). Alternatively you<br>

      can just run any of your applications until OOME.<br>

   2. Run the above application without generated pre-write barriers<br>

      (and  concurrent mark and refinement turned off). This run<br>

      becomes your baseline.<br>

   3. Run the above benchmark with the pre-write barriers, but in the<br>

      slow leaf call into the VM, discard the old buffer (but create a<br>

      new one). Turn off concurrent mark and concurrent refinement.<br>

      This run becomes your target.<br>

   If you now compare your baseline and your target, you should<br>

   essentially see the impact of just the pre-write barrier.<br>

<br>

Since the post-write barrier can be made almost identical to CMS (see my paragraph above), the overhead of the pre-write barrier would then be the barrier overhead for G1.<br>

<br>

Would you guys at Google be willing to help out with running these experiments? At the last CMS meeting Jeremy said that Google would be willing to help out with G1 improvements IIRC.<br></blockquote><div><br></div><div>Sure, I can do the measurement with DaCapo benchmark suite. I don't think we can run this experiment with the production workload.</div><div><br></div></div>

</div></div>