Removing G1 Reference Post Write Barrier StoreLoad Barrier

Mon Dec 22 22:19:53 UTC 2014

Hi,

On Mon, 2014-12-22 at 20:30 +0000, Erik Österlund wrote:
> Hi Thomas,
> 
> My assumption is more about fast/slow code paths than it is about
> fast/slow threads. 

Fast/slow threads what was I have been thinking of. If mutators are
picking up work, and are probably going to do most of the work, there
are no distinct slow/fast threads.

> And reference writes is something I consider a fast path. Although
> the frequency of inter regional pointer writes is different in
> different applications, I think that by having a storeload fence
> in this G1 barrier, it gives rise to some awkward cases like sorting
> large linked lists where performance becomes suboptimal, so it
> would be neat to get rid of it and get more consistent and
> resilient performance numbers.

Sorting linked lists is suboptimal with current G1 with or without the
change as every reference write potentially creates a card entry. I
guess most time will be spent in actual refinement in this case anyway.

> With that being said, the local cost for issuing this global fence
> (~640 nano seconds on my machine and my implementation based on
> mprotect which seems the most portable) is amortised away for both
> concurrent refinement threads and mutators alike since they both
>buffer cards to be processed and can batch them and amortise the cost.
> I currently batch 128 cards at a time and the cost of the global
> fence seems to have vanished.

Some Google search indicates that e.g. sparc m7 systems may have up to
1024 cores with 4096 threads (a very extreme example). Larger Intel
systems may also have 100+ threads. Current two socket Intel systems
reach 32+ threads.

Mprotect will flush all store buffers of all processors every time. So
you charge everyone (also not only the VM; consider running multiple VMs
on a system). This is what Jon has been concerned about, how scalable is
this.

There is chance that a (potentially much more local) StoreLoad is much
less expensive than mprotect with a modestly large system overall.

> If I understand you correctly, the frequency of invoking card
> refinement from mutators might have to increase by giving them
> smaller dirty card buffers because we can’t have too many dirty
> cards hanging around per mutator thread if we want to have good
> latency and lots of threads?

This is one problem, the other that mutator threads themselves need to
do more refinement than they do with current settings. Which means more
full buffers with more frequent mprotect calls. There may be some
possibility to increase the buffer size, but 128 cards seems already
somewhat large (I have not measured it, do not even know the current
buffer size).

DaCapo is unfortunately not a particular good benchmark on large
systems. Even h2 runs very well with a few 100MBs of heap and is very
short.

> In that case, the minimum size of
> mutator dirty card buffers could impact the batch size so the
> constants matter here. But 128 seems like a rather small constant,
> do you think we would run into a situation where that matters?

No, not particularly in that situation, see above. 128 cards may already
be quite a lot of work to do though.

> Personally I think that if somebody has a billion threads and don’t
> do anything else than inter regional pointer writes and at the same
> time expects flawless latency, then perhaps they should rethink what
> they are doing haha!

:)

> Hmm or maybe a VM flag could let the user choose
> if they have weird specific requirements? UseMembar seems to already
> be used for situations like these.

I think we will give your changes a few tries, at least run it through a
few tests.

Thanks,
  Thomas