Removing G1 Reference Post Write Barrier StoreLoad Barrier

Mon Dec 22 20:30:00 UTC 2014

Hi Thomas,

My assumption is more about fast/slow code paths than it is about fast/slow threads. And reference writes is something I consider a fast path. Although the frequency of inter regional pointer writes is different in different applications, I think that by having a storeload fence in this G1 barrier, it gives rise to some awkward cases like sorting large linked lists where performance becomes suboptimal, so it would be neat to get rid of it and get more consistent and resilient performance numbers.

With that being said, the local cost for issuing this global fence (~640 nano seconds on my machine and my implementation based on mprotect which seems the most portable) is amortised away for both concurrent refinement threads and mutators alike since they both buffer cards to be processed and can batch them and amortise the cost. I currently batch 128 cards at a time and the cost of the global fence seems to have vanished.

If I understand you correctly, the frequency of invoking card refinement from mutators might have to increase by giving them smaller dirty card buffers because we can’t have too many dirty cards hanging around per mutator thread if we want to have good latency and lots of threads? In that case, the minimum size of mutator dirty card buffers could impact the batch size so the constants matter here. But 128 seems like a rather small constant, do you think we would run into a situation where that matters? Personally I think that if somebody has a billion threads and don’t do anything else than inter regional pointer writes and at the same time expects flawless latency, then perhaps they should rethink what they are doing haha! Hmm or maybe a VM flag could let the user choose if they have weird specific requirements? UseMembar seems to already be used for situations like these.

Thank you for a fruitful discussion.

/Erik

> On 22 Dec 2014, at 18:31, Thomas Schatzl <thomas.schatzl at oracle.com> wrote:
> 
> Hi,
> 
> On Mon, 2014-12-22 at 18:52 +0100, Thomas Schatzl wrote:
>> Hi,
>> 
>> On Mon, 2014-12-22 at 09:09 -0800, Jon Masamitsu wrote:
>>> Erik,
>>> 
>>> One concern regarding the use of asymmetric Dekker synchronization
>>> (ADS) is how well this technique scales to 1000's of threads.  Do you 
>>> have an
>>> implementation where you can measure the scalability?
>> 
>>  another potential problem is that mutator threads might (and already
>> do in some workloads) also help with refinement.
>> 
> 
> ^^^ which behavior at least partially invalidate your assumptions about
> having "fast" and "slow" threads.
> 
> Thanks,
>  Thomas
> 
>