RFC: Throughput barriers for G1
Erik Helin
erik.helin at oracle.com
Wed Nov 16 15:53:29 UTC 2016
On 11/10/2016 09:04 PM, Jungwoo Ha wrote:
>
> On 11/09/2016 08:13 PM, Jungwoo Ha wrote:
>
> Usually single store is faster than load & store on the same
> cache line.
> I don't think card(o.x) is preloaded in cache to make if check
> cheap.
>
>
> There are also reasons for limiting the number of writes to the card
> table. For example, see
> https://blogs.oracle.com/dave/entry/false_sharing_induced_by_card
> <https://blogs.oracle.com/dave/entry/false_sharing_induced_by_card>.
>
>
> store goes to the store buffer and marking the cache line with M state
> is done at the background.
> If the mutator doesn't read the card table at all, the cache line will
> stay M until the GC reads the cards (changing it to S), thus saving
> cache coherence traffic.
> Load will trigger the cache-line to most likely S state, which is a high
> latency load if the previous state is M, and the mutator is paying the
> long latency loads.
> I am not sure if there is any win with UseCondCardMark at least on x86.
> Adding a branch adds a potential overhead on branch prediction as well.
> You can probably use cmov instruction, but that's also not as cheap as
> ordinary mov.
Sure, adding a branch might not (or might) yield better results, I just
wanted to highlight that there is room for experimentation here.
On 11/10/2016 09:04 PM, Jungwoo Ha wrote:
> You can start measuring without implementing the idea, the following
> experiment will show you the cost of the pre-write barrier for your
> workloads:
> 1. Find an application that doesn't need mixed GCs (for example by
> increasing young gen size and max heap size). Alternatively you
> can just run any of your applications until OOME.
> 2. Run the above application without generated pre-write barriers
> (and concurrent mark and refinement turned off). This run
> becomes your baseline.
> 3. Run the above benchmark with the pre-write barriers, but in the
> slow leaf call into the VM, discard the old buffer (but create a
> new one). Turn off concurrent mark and concurrent refinement.
> This run becomes your target.
> If you now compare your baseline and your target, you should
> essentially see the impact of just the pre-write barrier.
>
> Since the post-write barrier can be made almost identical to CMS
> (see my paragraph above), the overhead of the pre-write barrier
> would then be the barrier overhead for G1.
>
> Would you guys at Google be willing to help out with running these
> experiments? At the last CMS meeting Jeremy said that Google would
> be willing to help out with G1 improvements IIRC.
>
>
> Sure, I can do the measurement with DaCapo benchmark suite. I don't
> think we can run this experiment with the production workload.
Alright, thanks for helping out. I would also be very useful if you have
some form of internal benchmarks and/or applications that behaves like
the workloads you are running today, just to get a clear picture of the
potential improvement.
Thanks,
Erik
More information about the hotspot-gc-dev
mailing list