RFR: 8079315: UseCondCardMark broken in conjunction with CMS precleaning

Tue May 12 07:59:01 UTC 2015

Vitaly,

On 2015-05-12 02:54, Vitaly Davidovich wrote:
> Erik,
>
> Thanks for the explanation - this is a clever trick! :)
>
> Out of curiosity, was there an explanation/theory why this didn't matter
> for G1? Are most write barriers there eliminated via some other means?

The G1 write barrier has a conditional check for writes to objects in 
young regions and elides the storeload barrier for those.

/Mikael

>
> sent from my phone
> On May 11, 2015 12:51 PM, "Erik Österlund" <erik.osterlund at lnu.se> wrote:
>
>> Hi Andrew,
>>
>>> On 11 May 2015, at 17:21, Andrew Haley <aph at redhat.com> wrote:
>>>
>>> On 05/11/2015 05:06 PM, Vitaly Davidovich wrote:
>>>
>>>>> Also the global operation is not purely, but “mostly" locally expensive
>>>>> for the thread performing the global fence. The cost on global CPUs is
>>>>> pretty much simply a normal fence (roughly). Of course there is always
>>>>> gonna be that one guy with 4000 CPUs which might be a bit awkward.
>>>
>>> Well yes, but that guy with 4000 CPUs is precisely the target for
>>> UseCondCardMark.
>>
>> Okay. That should be fine still as I described, but a bit expensive to
>> benchmark it and fine tune I guess. I don’t have access to any such
>> machines. :( If somebody does we could find out.
>>
>>>
>>>>> But even then, with high enough n, shared, timestamped global
>>>>> fences etc, even such ridiculous scalability should be within
>>>>> reach.
>>>>
>>>> Is it roughly like a normal fence for remote CPUs?
>>>
>>> I would not think so.  Surely you'd have to interrupt every core in
>>> the process and do a bunch of flushes.  A TLB flush is expensive, as
>>> is interrupting the core itself.  I'm fairly sure there's no way to
>>> flush a remote core's TLB without interrupting it.
>>>
>>
>> Yes but in a round robin fashion using e.g. APIC on x86, not necessarily
>> all globally at the same time. It’s like message passing. And the TLBs will
>> only be purged for the range of the memory protection; this is a single
>> page that those remote CPUs don’t even have in their TLB caches, and
>> therefore no remote TLB caches will be changed.
>>
>> For e.g. x86_64, the APIC message itself will fence and then it will run
>> the code to find out that no TLB entries needs changing and that’s pretty
>> much it.
>>
>> This is not a scalability bottleneck at all and the constant costs I
>> already know are not problematic because I use this technique quite a lot
>> myself and Thomas Schatzl was kind enough to thoroughly benchmark such a
>> card cleaning solution for me on G1 around new year on a number of
>> benchmarks and machines. The conclusion for G1 was that it didn’t matter
>> performance wise. Also that constant cost is amortized away arbitrarily by
>> regulating its frequency.
>>
>> Thanks,
>> /Erik
>>
>>> Andrew.
>>
>>