RFR: 8079315: UseCondCardMark broken in conjunction with CMS precleaning

Tue May 12 12:08:42 UTC 2015

Hi Mikael and Vitaly,

Yeah G1 skips storeload for young regions, and also pointers to the same region (which are probably pretty common).
Just to clear things up - it seems like my approach might be interesting here. Would anyone volunteer to help out and do some benchmarking if I send a patch?

Cheers,
/Erik

> On 12 May 2015, at 08:59, Mikael Gerdin <mikael.gerdin at oracle.com> wrote:
> 
> Vitaly,
> 
> On 2015-05-12 02:54, Vitaly Davidovich wrote:
>> Erik,
>> 
>> Thanks for the explanation - this is a clever trick! :)
>> 
>> Out of curiosity, was there an explanation/theory why this didn't matter
>> for G1? Are most write barriers there eliminated via some other means?
> 
> The G1 write barrier has a conditional check for writes to objects in young regions and elides the storeload barrier for those.
> 
> /Mikael
> 
>> 
>> sent from my phone
>> On May 11, 2015 12:51 PM, "Erik Österlund" <erik.osterlund at lnu.se> wrote:
>> 
>>> Hi Andrew,
>>> 
>>>> On 11 May 2015, at 17:21, Andrew Haley <aph at redhat.com> wrote:
>>>> 
>>>> On 05/11/2015 05:06 PM, Vitaly Davidovich wrote:
>>>> 
>>>>>> Also the global operation is not purely, but “mostly" locally expensive
>>>>>> for the thread performing the global fence. The cost on global CPUs is
>>>>>> pretty much simply a normal fence (roughly). Of course there is always
>>>>>> gonna be that one guy with 4000 CPUs which might be a bit awkward.
>>>> 
>>>> Well yes, but that guy with 4000 CPUs is precisely the target for
>>>> UseCondCardMark.
>>> 
>>> Okay. That should be fine still as I described, but a bit expensive to
>>> benchmark it and fine tune I guess. I don’t have access to any such
>>> machines. :( If somebody does we could find out.
>>> 
>>>> 
>>>>>> But even then, with high enough n, shared, timestamped global
>>>>>> fences etc, even such ridiculous scalability should be within
>>>>>> reach.
>>>>> 
>>>>> Is it roughly like a normal fence for remote CPUs?
>>>> 
>>>> I would not think so.  Surely you'd have to interrupt every core in
>>>> the process and do a bunch of flushes.  A TLB flush is expensive, as
>>>> is interrupting the core itself.  I'm fairly sure there's no way to
>>>> flush a remote core's TLB without interrupting it.
>>>> 
>>> 
>>> Yes but in a round robin fashion using e.g. APIC on x86, not necessarily
>>> all globally at the same time. It’s like message passing. And the TLBs will
>>> only be purged for the range of the memory protection; this is a single
>>> page that those remote CPUs don’t even have in their TLB caches, and
>>> therefore no remote TLB caches will be changed.
>>> 
>>> For e.g. x86_64, the APIC message itself will fence and then it will run
>>> the code to find out that no TLB entries needs changing and that’s pretty
>>> much it.
>>> 
>>> This is not a scalability bottleneck at all and the constant costs I
>>> already know are not problematic because I use this technique quite a lot
>>> myself and Thomas Schatzl was kind enough to thoroughly benchmark such a
>>> card cleaning solution for me on G1 around new year on a number of
>>> benchmarks and machines. The conclusion for G1 was that it didn’t matter
>>> performance wise. Also that constant cost is amortized away arbitrarily by
>>> regulating its frequency.
>>> 
>>> Thanks,
>>> /Erik
>>> 
>>>> Andrew.
>>> 
>>>